NL连接一定是小表驱动大表效率高吗

数据运维 2023-08-15 三掌柜手机阅读

GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。
GreatSQL是MySQL的国产分支版本，使用上与MySQL一致。
作者： JennyYu
文章来源：GreatSQL社区原创

前言

两表使用nest loop（以下简称NL）方式进行连接，小表驱动大表效率高，这似乎是大家的共识，但事实上这是有条件的，并不总是成立。这主要看大表扫描关联字段索引后返回多少数据量，是否需要回表，如果大表关联后返回大量数据，然后再回表，这个代价就会很高，大表处于被驱动表的位置可能就不是最佳选择了。

实验举例

使用benchmarksql压测的两个表bmsql_warehouse与bmsql_order_line来测试，初始化10仓数据。

mysql> show create table bmsql_warehouse\G *************************** 1. row *************************** Table: bmsql_warehouse Create Table: CREATE TABLE `bmsql_warehouse` ( `w_id` int NOT NULL, `w_ytd` decimal(12,2) DEFAULT NULL, `w_tax` decimal(4,4) DEFAULT NULL, `w_name` varchar(10) DEFAULT NULL, `w_street_1` varchar(20) DEFAULT NULL, `w_street_2` varchar(20) DEFAULT NULL, `w_city` varchar(20) DEFAULT NULL, `w_state` char(2) DEFAULT NULL, `w_zip` char(9) DEFAULT NULL, PRIMARY KEY (`w_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

mysql> show create table bmsql_order_line\G *************************** 1. row *************************** Table: bmsql_order_line Create Table: CREATE TABLE `bmsql_order_line` ( `ol_w_id` int NOT NULL, `ol_d_id` int NOT NULL, `ol_o_id` int NOT NULL, `ol_number` int NOT NULL, `ol_i_id` int NOT NULL, `ol_delivery_d` timestamp NULL DEFAULT NULL, `ol_amount` decimal(6,2) DEFAULT NULL, `ol_supply_w_id` int DEFAULT NULL, `ol_quantity` int DEFAULT NULL, `ol_dist_info` char(24) DEFAULT NULL, PRIMARY KEY (`ol_w_id`,`ol_d_id`,`ol_o_id`,`ol_number`), KEY `ol_stock_fkey` (`ol_supply_w_id`,`ol_i_id`), KEY `ol_d_id` (`ol_d_id`), CONSTRAINT `ol_order_fkey` FOREIGN KEY (`ol_w_id`, `ol_d_id`, `ol_o_id`) REFERENCES `bmsql_oorder` (`o_w_id`, `o_d_id`, `o_id`), CONSTRAINT `ol_stock_fkey` FOREIGN KEY (`ol_supply_w_id`, `ol_i_id`) REFERENCES `bmsql_stock` (`s_w_id`, `s_i_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

查看如下sql的执行计划与效率：

select * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id where a.ol_dist_info like 'a%' and b.w_ytd =300000.00; mysql> explain analyze select * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id -> where a.ol_dist_info like 'a%' and b.w_ytd =300000.00; +--------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------+ | -> Nested loop inner join (cost=396352.21 rows=323755) (actual time=11.542..19705.922 rows=115207 loops=1) -> Filter: (b.w_ytd = 300000.00) (cost=1.15 rows=9) (actual time=0.780..0.893 rows=10 loops=1) -> Table scan on b (cost=1.15 rows=9) (actual time=0.743..0.810 rows=10 loops=1) -> Filter: (a.ol_dist_info like 'a%') (cost=12059.95 rows=35973) (actual time=1.401..1969.304 rows=11521 loops=10) -> Index lookup on a using ol_d_id (ol_d_id=b.w_id) (cost=12059.95 rows=323788) (actual time=1.388..1833.176 rows=300209 loops=10) |+--------------------------------------------------------------------+ 1 row in set (20.31 sec)

从上面的执行计划看出，优化器选择小表b表驱动大表a，b表返回10条记录，属于小表，a表为被驱动表，每次关联使用二级索引ol_d_id，扫描索引320209行，回表过滤后剩余11521行记录，属于大表，最终结果集返回115207行数据。使用此计划耗时20秒左右。

使用hint改变表的连接顺序

mysqlb> explain analyze select /*+ join_order(a,b) */ * from bmsql_order_line a join bmsql_warehouse b on a.ol_d_id=b.w_id where a.ol_dist_info like 'a%' and b.w_ytd =300000.00; +---------------------------------------------------------------------------+ | EXPLAIN | +---------------------------------------------------------------------------+ | -> Nested loop inner join (cost=408609.87 rows=323755) (actual time=1.374..4696.931 rows=115207 loops=1) -> Filter: (a.ol_dist_info like 'a%') (cost=295295.55 rows=323755) (actual time=1.036..4614.585 rows=115207 loops=1) -> Table scan on a (cost=295295.55 rows=2914088) (actual time=0.937..4275.678 rows=3002091 loops=1) -> Filter: (b.w_ytd = 300000.00) (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=115207) -> Single-row index lookup on b using PRIMARY (w_id=a.ol_d_id) (cost=0.25 rows=1) (actual time=0.000..0.000 rows=1 loops=115207) +----------------------------------------------------------------------------+ 1 row in set (4.79 sec)

从上面的执行计划看出，改变连接顺序后，大表a驱动小表b，此计划执行耗时4秒左右，相比小表b驱动大表a，时间上节省了近80%。由此可见，并不总是小表驱动大表效率高。

其实这属于两表关联，返回大量数据的SQL，在MySQL8.0版本可以控制优化器使用 hash join，走 hash join的效率会比NL要高。忽略两表关联字段上的索引，让优化器选择走 hash join。

mysql> explain analyze select * from bmsql_order_line a ignore index(ol_d_id) join bmsql_warehouse b ignore index(primary) on a.ol_d_id=b.w_id where a.ol_dist_info like 'a%' and b.w_ytd =300000.00; +----------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------+ | -> Inner hash join (a.ol_d_id = b.w_id) (cost=295489.08 rows=3997) (actual time=0.428..3586.047 rows=115207 loops=1) -> Filter: (a.ol_dist_info like 'a%') (cost=29634.41 rows=35973) (actual time=0.155..3549.633 rows=115207 loops=1) -> Table scan on a (cost=29634.41 rows=2914088) (actual time=0.133..2747.262 rows=3002091 loops=1) -> Hash -> Filter: (b.w_ytd = 300000.00) (cost=1.15 rows=9) (actual time=0.129..0.156 rows=10 loops=1) -> Table scan on b (cost=1.15 rows=9) (actual time=0.123..0.147 rows=10 loops=1) | +----------------------------------------------------------------------------------+ 1 row in set (3.67 sec)

此处注意：虽然官方文档上说可以使用BNL与NO_BNL的hint来启用与禁用 hash join，但是在关联字段上有索引的情况下，优化器不会评估 hash join的代价，也就不会选择 hash join，NO_BNL能够禁用 hash join，但是BNL并不能严格让优化器选择 hash join。

如果大表的关联字段使用索引覆盖，不需要回表的情况下执行效率如何呢？

看下面的SQL的执行计划，SQL中变换大表a的关联字段。

mysql> explain analyze select * from bmsql_order_line a join bmsql_warehouse b on a.ol_w_id=b.w_id where a.ol_dist_info like 'a%' and b.w_ytd =300000.00; +--------------------------------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------------------------------+ | -> Nested loop inner join (cost=494.86 rows=544) (actual time=0.868..4154.968 rows=115207 loops=1) -> Filter: (b.w_ytd = 300000.00) (cost=1.15 rows=9) (actual time=0.387..0.476 rows=10 loops=1) -> Table scan on b (cost=1.15 rows=9) (actual time=0.363..0.417 rows=10 loops=1) -> Filter: (a.ol_dist_info like 'a%') (cost=1.15 rows=60) (actual time=0.119..414.532 rows=11521 loops=10) -> Index lookup on a using PRIMARY (ol_w_id=b.w_id) (cost=1.15 rows=544) (actual time=0.109..385.753 rows=300209 loops=10) | +-------------------------------------------------------------------------------------------+ 1 row in set (4.23 sec)

从上面的执行计划看出，优化器依然选择小表b驱动大表a，大表作为被驱动表，使用主键进行扫描，不需要回表，在此例子中小表驱动大表与大表驱动小表的执行耗时是差不多的，哪种方式效率高主要看大表过滤后的数据量占全表的百分比，不同的数据量可能就需要选择不同的方式。

总结

MySQL8.0 有两种连接方式，选择NL还是 hash join，要看两表关联后返回少量数据还是大量数据，一般情况下，少量数据 NL 优于 hash join，大量数据，hash join 优于 NL。

如果只能选择NL连接（低于MySQL8.0的版本），那么在NL 情况下，是小表驱动大表快还是大表驱动小表快，看大表关联使用的索引是否形成索引覆盖，及关联后返回的数据量。

大表关联使用二级索引，关联后返回大量数据，又需要回表，这种情况下，一般选择大表驱动小表效率高些；关联后返回少量数据，一般选择小表驱动大表效率高些。

大表关联使用索引覆盖，要看大表过滤后的数据量占全表的百分比，不同的数据量可能就需要选择不同的方式。

不要试图去记住这些结论，深入了解表的连接方式与扫描方式，理解SQL的执行过程，一切都会变得顺理成章，我们的人脑会对SQL选择哪种执行计划执行效率高有一个清晰的判断，如果优化器做出错误的决策，可以尝试使用各种优化方式干涉优化器的决策。

NL连接一定是小表驱动大表效率高吗

前言

实验举例

总结

MySQL的Replace into 与Insert into on duplicate key update真正的不同之处

MSSQL批量替换语句 在SQL SERVER中批量替换字符串的

如何将带格式的文本存进数据库？ (带格式文本存进数据库)

怎么查看MYSQL是否已激活

MySQL服务器进程CPU占用100%的解决方法

MSSQL批量替换语句在SQL SERVER中批量替换字符串的