Before starting the analysis, let us first think about some of the following interview questions: 1. What is the index data structure of InnoDB? Why use this data structure? 2. What is the difference between a clustered index and a normal index? 3. What is table return operation?

2024/05/0100:23:33 technology 1844

Before starting the analysis of

, let us first think about some of the following interview questions:
1. What is the index data structure of InnoDB? Why use this data structure? 2.What is the difference between clustered index and ordinary index? 3. What is table return operation? Does it have any effect on indexing ? The growth process of B+ tree of Mysql index is shown in the figure below:

. How the B+ index tree grows 2.1 Data query without index The data page is the smallest unit of data management in Mysql. Since we want to study how the index can be efficiently queried For data, first we must understand how the data is stored. From the previous article, we learned that the structure of the data page is roughly like this:

and each row of data in the data table is stored in the data area. In the data area Each row of data is connected through pointers in the form of a one-way linked list, as shown in the following figure:

At the same time, each data page is organized and connected in the form of a two-way linked list , as shown in the following figure:

(1) None Data query during indexing
Through the above preliminary analysis of the data page and the internal data structure of the data page, we can now take a look at what kind of process it will go through if we want to query a certain row of data in a certain table. The data pages are of course stored in the disk at the beginning. A table pair usually corresponds to multiple data pages. When querying data, the data pages are loaded from the disk into the InnoDB buffer pool in sequence, and then the cached pages in the buffer pool are loaded. Each row of data is traversed and searched one by one through the one-way linked list of the data page. If it is not found, it will follow the doubly linked list data structure of the data page and traverse other data pages in the loading disk to the buffer pool for traversal query.

As you can see, the query method like the above is a bit silly, because if the data row you want to check happens to be in the last row of the last data page of this table, wouldn't all the data pages have to be scanned? , and then each data page also traverses the linked list. The overall effect is to traverse the linked list with O(n) time complexity, so the query performance is definitely not good.

(2) Optimize the query efficiency within the data page - slot Let’s first shift our attention to the data query within a single data page. Suppose we have locked the data in a certain data page, but how can we quickly How to find the row of data we want from this data page? From the previous analysis, we can know that the stupidest way is to traverse the one-way linked list query in the data page and scan node by node. The corresponding query efficiency is very low visible to the naked eye. But if we can reduce the scope of our query based on the directory just like flipping through a book, wouldn't the corresponding query efficiency increase? Based on this idea, the InnoDB storage engine has designed slots to organize the data pages. Multiple data rows, slot information is stored in the data page directory in the data page. Simply speaking, the

slot is to group multiple data rows in the data page. Each data row group finds the address of the data row with the largest primary key value in the group as the slot information. In this way, the data page directory Isn’t each slot in the directory just like a directory? The location information of multiple data row groups is marked, as shown in the following figure:

Now we have the slot information in the data page directory. At this time, we need to Querying a certain row of data in the data page is very simple. For example, if we want to query the row of data with the primary key 4, we directly lock slot 2 in the data page directory with O(logn) time complexity through the bisection method. Because the slots are all closely connected, you can find slot 1 through slot 2. Starting from the end of slot 1, start traversing the data in group 2. Because the amount of data in each group is very small, this In such a small range, you can quickly find the row of data with the primary key 4 by simply traversing it. The time complexity is reduced from the previous O(n) to O(logn), and the efficiency is quite impressive.But if you don't query through the primary key, the slot will not be used at this time, and you have to traverse the one-way linked list in the data page one by one to find the row of data you want.

2.2 The eve of indexing - page splitting

Here we have a small episode to briefly understand page splitting. This content is also the basis for the normal operation of the subsequent indexing mechanism. We all know that a data page is 16KB in size. When there are enough data rows in a data page, a new data page will be created to continue writing data rows. It is okay if we do not use the index, but if we want to create it in the table Index, then there are constraints on the data in multiple data pages.

If the primary key value of the data row in the newly created data page is smaller than the primary key value of the previous data page, this situation is not allowed, as shown in the following figure:

If the above appears In the case of the figure, the primary keys between multiple data pages are out of order, and the implementation of the index mechanism is based on the size of the primary keys of multiple data pages increasing sequentially, so page splits will occur at this time.

In fact, the purpose of page splitting is also very clear, which is to adjust the data order of different data pages so that between the index pages finally created in sequence, the primary key value of each data row in the latter data page is greater than the previous data. Pages. Of course, a data page increases sequentially in the form of a one-way linked list. The page splitting process is as shown in the figure below:

. We can see that page splitting mainly adjusts the order of data rows between data pages. The primary key values between multiple data pages are stored in order. In such ordered data, efficient query becomes possible. Page splits occur frequently. After all, page splits involve the movement of data, which will also cause performance losses. This also reminds us that it is very necessary to reduce the probability of page splits. When designing the table structure, we can try our best to Using the primary key auto-increment method instead of custom-creating the primary key that is difficult to guarantee the order of the primary key. Using the primary key auto-increment method can greatly avoid the problem of out-of-order primary key sizes between data pages and reduce the occurrence of page splits. Probability.

2.2 Querying a row of data from the primary key directory to the index page

means locating which row of data in which data page at the physical level. To solve the problem of locating data in data pages, we have optimized the query efficiency through slots before. Now what we have to solve is how to locate data pages in a large number of data pages. This is the goal of indexing.

(1) Primary key directory
InnoDB storage engine initially used the primary key directory, using the data page number and the smallest primary key value of the data page as a record, as shown in the following figure:

In this case, which piece of data we want to check is There is no need to scan all the data in one data page and then scan the next one. You can directly go to the primary key directory through the ID and use binary search to locate the specific data page. Then, you can traverse the data row corresponding to that slot by locating the slot inside the data page. Group to find a specific row of data.

(2) Index page
Now there is a problem that there are many data pages corresponding to each table. The primary key directory will have a lot of data and may not be able to fit in it. At this time, InnoDB designers want to store the directory data It's also data, why can't it be stored using the data page? In this way, the information in the primary key directory is moved to the data page, and these data pages are called index pages, as shown in the following figure:

From here we can Know that the data page is definitely not just a store of data in the data table.Okay, now that the capacity of the primary key directory is limited, we have moved the primary key directory information to the data page to form an index page, but the same problem will still occur. The size of a data page is only 16KB, and the capacity of the index page itself is also It is limited, what should I do if the capacity is not enough?

In order to solve the problem of insufficient index page capacity, the index page will be re-created and upgraded. First, the data that exceeds the capacity will be put into a new index page, and then another layer of index pages will be added, as shown in the following figure:

from the above figure We can see that the new first-level index page 35 does not store the data page directory corresponding to the minimum primary key, but the index page directory corresponding to the minimum primary key. By analogy, if the capacity of index page 35 is not enough, Then continue to expand to the next level. The final effect looks like the following:

Can you see that the structure composed of index pages layer by layer is what we often call an index tree, and this tree It is called B+ index tree in mysql. Data structures like trees can naturally be queried using binary methods, so now if we want to query a piece of data, we start from the root node of the tree and search through binary methods, lock the data page with O(logn) time complexity, and then search in the data page We also use the dichotomy method to lock the slot, and we can find the data by simply traversing the slot. Compared with the scenario without index, the speed is quite fast.

. Clustered index, ordinary index and covering index has some common terms about indexing that we need to distinguish. First of all, a clustered index is just like the tree we saw above. Its leaf nodes are data pages. These data pages store the complete data of each row in the data table, so if the B+ tree The data page of the complete data is used as the leaf node. We call this index tree a clustered index. If the index tree of an index does not use the data page as the leaf node, it is called a secondary index or a normal index. .
The biggest difference between a clustered index and a normal index is that the leaf nodes of the clustered index store the complete data of the data row, while the leaf nodes of the secondary index only store part of the data fields. In
, the covering index itself is not an index, but a way to query data. For example, we create an index on the field name in the table, and then we execute the query such as: select name from table where name like '张%' , at this time, we can directly query the corresponding batch of name values from the B+ tree species corresponding to the name field, and then return it directly. In other words, the field name we want is already on the index, and we can directly use the dichotomy method to efficiently Just pick it directly from the tree, and this query method is called a covering index.
Of course, compared to the covering index method, if the query is changed to: select * from table where name like '张%', this is not a covering index, because at this time you not only have to find the specific name from the index tree, but also Use the id value to return to the table to query all fields.

. Analysis of the advantages and disadvantages of indexes. The advantage of indexes is of course efficient querying of data. The index optimizes the query time complexity of O(n) for traversing the linked list into O(logn) time complexity. However, the shortcomings of the index are also obvious. First of all, from a time perspective, it must require the primary keys to grow in order. Unordered primary keys will cause frequent page splits and affect efficiency; while adding, deleting, and modifying operations on database tables, Indexes also need to be maintained, and this part of maintenance is also a point of performance loss; from a space perspective: index-related data takes up the same memory space as actual data. Therefore, although indexes can improve query efficiency, they must also bear the performance loss it brings to our system. From this point of view, the more indexes you build, the better.

. Design a good index in three dimensions

Next, we optimize the index design from the following three dimensions

(1) First, from a time perspective we need to use primary key auto-growth as much as possible in order to avoid frequent page splits. , ensure that the primary keys of data rows in newly added data pages are incremented, to avoid performance losses caused by unnecessary page splits and slow down query efficiency.

In addition, it is also important to select the appropriate field as the index field. You need to choose a field with a larger cardinality, that is, a field may have more values. In this way, when we query in the B+ tree, we can use the dichotomy query most efficiently. The power of binary search may degenerate into a linear query with a time complexity of O(n) if the field base of the index is relatively small.

(2) From a space perspective,
because the index data itself also takes up space, you can choose a smaller field length as the index field, so that the entire B+ tree does not take up so much space. But if you have to use long fields as indexes, it is not impossible. You can use the prefix of the field as the index. Such an index is also called a prefix index, but this may only be used for fuzzy queries and group by. It is not suitable for order by.

(3) The scope of action is
. Of course, the purpose of designing the index is to make better use of the index. When designing the index, try to make it possible for statements such as where, group by, and order by to use the index.

technology

On July 2, something happened that had great influence in the global technology circle, because Huawei officially announced on the same day that the new brand Huawei Imaging XMAGE was officially released, and stated that the brand is the exclusive brand of Huawei mobile imaging.

US media: Huawei won decisively

05/22 1176

On June 30, Defan Information, as an outstanding representative manufacturer of low-code, was invited to participate in the Haibi Research Institute's "2022 China·Low-code/No-code Market Research and Selection Assessment Report" release forum, aiming to prepare for the upcoming 2 - DayDayNews

On June 30, Defan Information, as an outstanding representative manufacturer of low-code, was invited to participate in the Haibi Research Institute's "2022 China·Low-code/No-code Market Research and Selection Assessment Report" release forum, aiming to prepare for the upcoming 2

Strength Certification | Defan was selected into the "2022 China Low/No Code Market Research and Selection Assessment Report"

05/22 1125

Radar human sensing technology can detect the presence of stationary people in real time, combining the high precision of FMCW with the low power consumption advantages of pulse radar; the module uses serial port UART output, which can penetrate ceramics, plastic shells, glass an - DayDayNews

Radar human sensing technology can detect the presence of stationary people in real time, combining the high precision of FMCW with the low power consumption advantages of pulse radar; the module uses serial port UART output, which can penetrate ceramics, plastic shells, glass an

Microwave radar induction module, intelligent induction control technology solution, real-time detection of human body presence application

05/22 1882

The exhibition hall is a public building used to display temporary exhibits and carries the task of promoting various exhibits and culture. Since people are now willing to improve their cultural literacy by visiting exhibitions, exhibition halls are becoming more and more popular - DayDayNews

The exhibition hall is a public building used to display temporary exhibits and carries the task of promoting various exhibits and culture. Since people are now willing to improve their cultural literacy by visiting exhibitions, exhibition halls are becoming more and more popular

Exhibition hall mini program development, the trend of exhibitions in the information age

05/22 1994

At this conference, players from all walks of life will surely come up with their own trump cards. It is said that Tesla will launch the humanoid robot "Optimus Prime". The robot's limbs are controlled by 40 electromechanical actuators, which can basically free human hands. - DayDayNews

At this conference, players from all walks of life will surely come up with their own trump cards. It is said that Tesla will launch the humanoid robot "Optimus Prime". The robot's limbs are controlled by 40 electromechanical actuators, which can basically free human hands.

490,000 units sold! Industrial robots, a shining golden track

05/22 1222

#人世故事# Speaking of Ma Huateng, everyone may feel unfamiliar, but when it comes to QQ and WeChat, everyone must know who it is. That's right, he is QQ, the boss of WeChat, known as the "Father of QQ". Ma Huateng was born in Chaozhou in 1971. He followed his father to Shenzhen when - DayDayNews

#人世故事# Speaking of Ma Huateng, everyone may feel unfamiliar, but when it comes to QQ and WeChat, everyone must know who it is. That's right, he is QQ, the boss of WeChat, known as the "Father of QQ". Ma Huateng was born in Chaozhou in 1971. He followed his father to Shenzhen when

Ma Huateng: My ability to become the richest man was created by my “son”

05/22 1430

ZB.com Exchange: A run has caused QC to decouple, misappropriate members’ assets, and face huge risks. Run away quickly. In the past two days, many fans have reported that there is a problem with ZB.com, and the problem is still serious. - DayDayNews

ZB.com Exchange: A run has caused QC to decouple, misappropriate members’ assets, and face huge risks. Run away quickly. In the past two days, many fans have reported that there is a problem with ZB.com, and the problem is still serious.

July 4th: Expose the latest crash and upcoming problems on the platform

05/22 1496

Launch time: 5:04 on June 30, 2022 (BTJ) Launch location: Cape Canaveral Launch Pad 40 Launch rocket: Falcon 9 B1073. - DayDayNews

Launch time: 5:04 on June 30, 2022 (BTJ) Launch location: Cape Canaveral Launch Pad 40 Launch rocket: Falcon 9 B1073.

Falcon 9 launches SES television broadcast satellite, SpaceX has completed 27 launches in the first half of the year

05/22 1332

At 11 o'clock in the night, Wu Li breathed a sigh of relief at his workstation after submitting the last data packet. As an ordinary data annotator, this is the first time in a week that he gets off work earlier than midnight. - DayDayNews

At 11 o'clock in the night, Wu Li breathed a sigh of relief at his workstation after submitting the last data packet. As an ordinary data annotator, this is the first time in a week that he gets off work earlier than midnight.

Huawei Cloud assists Boden in the implementation of intelligent data annotation, providing new momentum for the high-quality development of AI

05/21 1145

July 4, 2022 According to IDC's "China Relational Database Software Market Tracking Report for the Second Half of 2021", China's relational database software market size in the second half of 2021 was US$1.58 billion, a year-on-year increase of 34.9%. Among them, the scale of pub - DayDayNews

July 4, 2022 According to IDC's "China Relational Database Software Market Tracking Report for the Second Half of 2021", China's relational database software market size in the second half of 2021 was US$1.58 billion, a year-on-year increase of 34.9%. Among them, the scale of pub

In the second half of 2021, China’s relational database software market changes are coming

05/21 1536