MySQL添加简单索引，表容量增长超预期的场景

数据运维 2024-02-21 捡田螺的小男孩手机阅读

数据库设计其实很重要，除了常规的规则，一些细节方面的设计，如果存在偏差，特定场景下，就可能对系统的运行性能产生影响。技术社群的这篇文章《技术译文 | 为什么 MySQL 添加一个简单索引后表大小增长远超预期？》就介绍了一个增加简单索引导致表容量增长远超预期的场景，值得学习和借鉴，在数据库设计时，需要格外关注。

仅保留必要的索引以降低写入性能和磁盘空间开销是一种众所周知的好习惯。MySQL 官方文档中简要提到了这个简单的规则[1]

然而，在某些情况下，添加新索引的开销可能远远超出预期！最近，我一直在分析一个这样的客户案例，它鼓励我分享这样的例子，因为它肯定会让许多开发人员甚至 DBA 感到惊讶！

让我们以这个非常具体的表为例，它只有三列。主键是在其中两个上创建的，

mysql > show create table t1G *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( `a` varchar(32) NOT NULL, `b` int unsigned NOT NULL, `c` varchar(32) NOT NULL, PRIMARY KEY (`a`,`c`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec)

使用此模式，具有 5M 随机行的表具有以下统计信息（请注意）：Index_lenght

mysql > show table status like 't1'G *************************** 1. row *************************** Name: t1 Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 4969944 Avg_row_length: 104 Data_length: 521125888 Max_data_length: 0 Index_length: 0 Data_free: 2097152 Auto_increment: NULL Create_time: 2024-01-22 22:39:52 Update_time: NULL Check_time: NULL Collation: utf8mb4_0900_ai_ci Checksum: NULL Create_options: Comment: 1 row in set (0.01 sec)

磁盘上的大小（表已优化以排除碎片），

$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 508M Jan 22 22:40 db1/t1.ibd

当我们的查询在 WHERE 子句中使用列 b 时，很自然地我们必须通过向该列添加索引来优化此类查询，以避免这种不良执行（即全表扫描），

mysql > EXPLAIN select * from t1 where b=10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 4976864 filtered: 0.00 Extra: Using where 1 row in set, 1 warning (0.00 sec)

因此，我们添加索引，

mysql > alter table t1 add key(b); Query OK, 0 rows affected (38.68 sec) Records: 0 Duplicates: 0 Warnings: 0

而且现在查询速度非常快。然而，表大小增加了 400 MB！

$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 908M Jan 22 23:11 db1/t1.ibd

您可能会问 — 在一个小的 INT 列上添加索引怎么可能实现如此巨大的增长？表大小增加了 79% ，完全出乎意料，因为我们只索引了表中最小的列！

如果我告诉你我实际上预计它会增长得更多，你会感到惊讶吗？原因是二级索引将主键列附加到其记录中。让我在这里引用文档[2]：

在 InnoDB 中，辅助索引中的每条记录都包含该行的主键列，以及为辅助索引指定的列。InnoDB 使用此主键值来搜索聚集索引中的行。

如果主键较长，则二级索引会占用更多空间，因此主键较短是有利的。

因此，在这种表情况下，新索引将包含所有三列，从而有效地复制所有表值！在更改之前，只有聚集（主）索引[3]，它保存整行数据（包括其定义的列和其他列）。b 列上的新索引包括该列以及定义为主键的列，因此再次，这个特定表案例中的所有三列。因此，我预计表的大小会增加一倍。让我们研究一下为什么这种情况没有发生以及为什么表空间文件没有增长到 ~1GB。

在 innodb_ruby[4]工具的帮助下，我在 ALTER 之前检查了索引统计信息，

$ innodb_space -f db1/t1.ibd space-indexes id name root fseg fseg_id used allocated fill_factor 314 4 internal 3 132 159 83.02% 314 4 leaf 4 27680 31638 87.46%

要找出 id=314 的索引是什么，我们可以使用以下查询，

mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name,FILE_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) WHERE t.NAME='db1/t1'G *************************** 1. row *************************** SPACE: 15 INDEX_ID: 314 index_name: PRIMARY table_name: db1/t1 FILE_SIZE: 532676608 1 row in set (0.00 sec)

因此表中大约有 32k 已分配页。该工具可以让您了解有关 InnoDB 表空间的许多详细信息，例如，每页使用情况是多少，即，

$ innodb_space -f db1/t1.ibd space-index-pages-summary|head -10 page index level data free records 4 314 2 9371 6819 131 5 314 0 15105 1055 189 6 314 0 15101 1063 183 7 314 1 15060 1088 214 8 314 0 15058 1104 184 9 314 0 15121 1041 184 10 314 0 15118 1044 184 11 314 0 15063 1101 180 12 314 0 15072 1092 180

添加二级索引后，我们可以看到更多关于新索引与主键对比的细节，

mysql > show table status like 't1'G *************************** 1. row *************************** Name: t1 Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 4981016 Avg_row_length: 104 Data_length: 521125888 Max_data_length: 0 Index_length: 413122560 Data_free: 4194304 Auto_increment: NULL Create_time: 2024-01-22 23:11:19 Update_time: NULL Check_time: NULL Collation: utf8mb4_0900_ai_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)

$ innodb_space -f db1/t1.ibd space-indexes id name root fseg fseg_id used allocated fill_factor 314 4 internal 3 132 159 83.02% 314 4 leaf 4 27680 31648 87.46% 315 39 internal 5 104 159 65.41% 315 39 leaf 6 21914 25056 87.46%

索引 315 是 b 列的次要索引，

mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name,FILE_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) WHERE t.NAME='db1/t1'G *************************** 1. row *************************** SPACE: 15 INDEX_ID: 314 index_name: PRIMARY table_name: db1/t1 FILE_SIZE: 952107008 *************************** 2. row *************************** SPACE: 15 INDEX_ID: 315 index_name: b table_name: db1/t1 FILE_SIZE: 952107008 2 rows in set (0.01 sec)

我们可以看到它分配的总页数较少，约为 25k，而主键为 32k。那么为什么聚集索引需要更多的页来保存相同的数据值呢？索引摘要显示了两者之间的差异，

$ innodb_space -f db1/t1.ibd space-index-pages-summary page index level data free records (...) 6327 314 0 15057 1107 181 6328 314 0 15138 1026 180 6329 314 0 15128 1038 178 6330 314 0 15134 1032 178 6331 314 0 15135 1031 179 6332 314 0 15071 1095 179 6333 314 0 15092 1072 180 6334 314 0 15072 1094 179 6335 314 0 15115 1051 179 6336 314 1 15064 1086 211 6337 314 1 15098 1052 210 6338 314 1 15112 1038 211 6339 314 1 15119 1029 212 6340 315 0 16072 70 227 6341 315 0 16135 5 228 6342 315 1 16132 16 215 6343 315 0 16118 22 228 6344 315 0 16097 45 226 6345 315 0 16074 66 229 6346 315 0 16093 47 229 6347 315 0 16091 49 228 6348 315 0 16073 69 227 6349 315 0 16092 48 230 6350 315 0 16095 45 228 6351 315 0 16133 9 227 6352 315 0 16126 14 230

innodb_ruby 工具让我们看到，与聚集索引（id 314）相比，二级索引（id 315）能够在一页上存储更多记录。后者每页留下更多可用空间。这解释了为什么重复值并没有完全导致重复表空间大小。该工具允许我们使用 GNUplot 很好地说明这一点，

复制主键值的一个重要副作用是列 b 的索引在我们的表案例中是覆盖索引[5]！

这就是为什么我们可以在 extra info[6] 中看到使用索引，即使索引仅在一列上，

mysql > EXPLAIN select * from t1 where b=10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: ref possible_keys: b key: b key_len: 4 ref: const rows: 1 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.00 sec)

由此得出的结论是，如果可能的话，我们希望主键更小。让我们相应地修改这个表，

mysql > alter table t1 drop primary key, add column id int primary key auto_increment; Query OK, 0 rows affected (16.66 sec) Records: 0 Duplicates: 0 Warnings: 0

这将表大小从 908M 减少到 608M，

$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 608M Jan 23 19:12 db1/t1.ibd

（磁盘空间被自动回收，因为更改主键会有效地重新创建表。）

但是，如果其他查询通过列 a 和 c 进行过滤，则没有多大意义，因为我们需要添加另一个索引来满足它们，由于这些列的大小很大，这会进一步放大整体大小，

mysql > alter table t1 add key a_c(a,c); Query OK, 0 rows affected (45.33 sec) Records: 0 Duplicates: 0 Warnings: 0

$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 1004M Jan 23 19:16 db1/t1.ibd

另一方面，此架构更改将有利于其他操作，例如 ROW 复制（简单 INT PK 查找与复合 varchar 查找）、校验和等。修改后的表定义如下所示，

mysql > show create table t1G *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( `a` varchar(32) NOT NULL, `b` int unsigned NOT NULL, `c` varchar(32) NOT NULL, `id` int NOT NULL AUTO_INCREMENT, PRIMARY KEY (`id`), KEY `b` (`b`), KEY `a_c` (`a`,`c`) ) ENGINE=InnoDB AUTO_INCREMENT=5001368 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec)

新的索引统计信息反映了优化的主键如何使相同的二级索引（id=323
）大小开销更小 – 4.7k 页而不是 25k 页，

mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name,FILE_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) WHERE t.NAME='db1/t1'G *************************** 1. row *************************** SPACE: 19 INDEX_ID: 322 index_name: PRIMARY table_name: db1/t1 FILE_SIZE: 1052770304 *************************** 2. row *************************** SPACE: 19 INDEX_ID: 323 index_name: b table_name: db1/t1 FILE_SIZE: 1052770304 *************************** 3. row *************************** SPACE: 19 INDEX_ID: 324 index_name: a_c table_name: db1/t1 FILE_SIZE: 1052770304 3 rows in set (0.00 sec)

$ innodb_space -f db1/t1.ibd space-indexes id name root fseg fseg_id used allocated fill_factor 322 4 internal 3 27 27 100.00% 322 4 leaf 4 29003 33184 87.40% 323 5 internal 5 6 6 100.00% 323 5 leaf 6 4157 4768 87.19% 324 32354 internal 7 104 159 65.41% 324 32354 leaf 8 21913 25056 87.46%

上面的例子有点不典型，但我的目标是提醒人们注意 InnoDB 引擎在索引方面经常被遗忘的行为。主键越大，添加二级索引的开销成比例地越大！磁盘使用情况的差异可能很大，因此在设计表时请记住这一点。

出于同样的原因，使用 UUID 作为主键的常见做法会损害性能和存储数据的总体成本。

参考资料[1]
optimization-indexes: https://dev.mysql.com/doc/refman/8.0/en/optimization-indexes.html
[2]
innodb-index-types: https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
[3]
innodb-index-types: https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
[4]
innodb ruby: https://github.com/jeremycole/innodb_ruby
[5]
glos_covering_index: https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_covering_index
[6]
explain-extra-information: https://dev.mysql.com/doc/refman/8.0/en/explain-output.html#explain-extra-information

如果您认为这篇文章有些帮助，还请不吝点下文章末尾的"点赞"和"在看"，或者直接转发pyq，

近期更新的文章：《"sql_require_primary_key"参数为ON如何修改主键？》
《公众号1400篇文章分类和索引》《公众号1400篇文章数据统计》《中国足球现状》《SQL统计腾讯会议参会信息技巧》
近期的热文：《推荐一篇Oracle RAC Cache Fusion的经典论文》
《"红警"游戏开源代码带给我们的震撼》
文章分类和索引：《公众号1400篇文章分类和索引》