GreatSQL 的刷新锁-每日运维

GreatSQL 的刷新锁

前言

因为运维小伙伴执行dump备份命令，导致数据库卡住，很多会话都在waiting for table flush，基于这一故障，我对GreatSQL的刷新锁进行了研究。感兴趣的小伙伴请随我一探究竟吧。

刷新锁的症状

刷新锁问题的主要症状是数据库会进入嘎然而止的状态，所有需要使用部分或全部表的新查询都停下来等待刷新锁。要寻找的信号如下：

1.新查询的查询状态为Waiting for table flush。这可能出现在所有新查询中，也可能只出现在访问特定表的新查询中。

2.数据库连接数增多，最终可能由于连接数用尽，新连接失败。

3.至少有一个查询的运行时间晚于最早的刷新锁请求。

4.进程列表中可能有flush table语句，也可能flush table语句已经超时（超过lock_wait_timeout设置）或被取消(会话被Ctr +C 终止或被kill)。

刷新锁构建

本实验使用的GreatSQL版本： 8.0.32-25 GreatSQL (GPL)。

创建四个连接，第一个连接执行一个慢查询，第二个连接用于执行flush tables语句，第三个连接执行第一个连接中慢查询语句相关表的快速查询。第四个连接执行其他表的查询和插入：

Connection 1> select count(*) ,sleep(100) from t1; Connection 2> flush tables; (flush tables with read lock;) Connection 3> select count(*) from t1; Connection 4> select count(*) from t2; insert into t2 values(5,'a');

刷新锁争用问题诊断及解决

1.flush tables 实验

Connection 2 执行flush tables 时，Connection 3 受阻塞，Connection 4成功执行。

使用sys.session视图来输出各会话运行情况，也可以使用show processlist 来展示。默认输出是按执行时间降序排列，这让查询刷新锁争用之类的问题变得容易。

[root@GreatSQL][test]>select thd_id,conn_id,state,current_statement,statement_latency from sys.session where command='Query'; +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ | thd_id | conn_id | state | current_statement | statement_latency | +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ | 116 | 61 | User sleep | select count(*),sleep(100) from t1 | 10.81 s | | 117 | 62 | Waiting for table flush | flush tables | 8.15 s | | 109 | 57 | Waiting for table flush | select count(*) from t1 | 3.91 s | | 118 | 63 | NULL | select thd_id,conn_id,state,cu ... .session where command='Query' | 71.49 ms | +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ 4 rows in set (0.07 sec)

从上面会话的查询结果可以看出，flush tables会话的状态是Waiting for table flush，在它之前有一个运行时间较长的查询，这是阻塞flush tables完成的查询。第三个查询的状态也是Waiting for table flush，说明flush tables语句又阻塞了其他查询。

而Connection4 未受影响，说明flush tables不影响其他表的读写操作。

当等待刷新锁成为问题时，这意味着有一条或多条查询阻塞了flush tables 语句获得刷新锁。由于flush tables语句需要一个排他锁，因此又会阻塞后续会话对相关表的共享锁或排他锁。

手动Ctr+C中断flush tables 会话后，再次查询各会话的运行情况。

[root@GreatSQL][test]>select thd_id,conn_id,state,current_statement,statement_latency from sys.session where command='Query'; +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ | thd_id | conn_id | state | current_statement | statement_latency | +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ | 116 | 61 | User sleep | select count(*),sleep(100) from t1 | 20.10 s | | 109 | 57 | Waiting for table flush | select count(*) from t1 | 13.19 s | | 118 | 63 | NULL | select thd_id,conn_id,state,cu ... .session where command='Query' | 68.14 ms | +--------+---------+-------------------------+-------------------------------------------------------------------+-------------------+ 3 rows in set (0.07 sec)

从查询结果可以看出，被flush table阻塞的查询依然还被阻塞着，这时候解决问题的办法就是结束第一个阻塞flush tables会话的慢查询。

2.flush table with read lock实验

Connection2 执行flush table with read lock 时，Connection 3 受阻塞，Connect 4的select成功，insert 被阻塞。

使用sys.session视图来输出各会话运行情况

[root@GreatSQL][test]>select thd_id,conn_id,state,current_statement,statement_latency from sys.session where command='Query'; +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | thd_id | conn_id | state | current_statement | statement_latency | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | 116 | 61 | User sleep | select count(*),sleep(100) from t1 | 52.74 s | | 117 | 62 | Waiting for table flush | flush table with read lock | 26.36 s | | 109 | 57 | Waiting for table flush | select count(*) from t1 | 22.00 s | | 124 | 69 | Waiting for global read lock | insert into t2 values(8,'b') | 6.01 s | | 118 | 63 | NULL | select thd_id,conn_id,state,cu ... .session where command='Query' | 82.90 ms | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ 5 rows in set (0.09 sec)

从上面结果看出， flush table with read lock 的会话状态为Waiting for table flush，Connection3 状态同样为Waiting for table flush，而Connect4的状态为Waiting for global read lock。

手动Ctr+C中断flush tables 会话后，再次查询各会话的运行情况。

[root@GreatSQL][test]>select thd_id,conn_id,state,current_statement,statement_latency from sys.session where command='Query'; +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | thd_id | conn_id | state | current_statement | statement_latency | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | 116 | 61 | User sleep | select count(*),sleep(100) from t1 | 1.37 min | | 109 | 57 | Waiting for table flush | select count(*) from t1 | 51.58 s | | 124 | 69 | Waiting for global read lock | insert into t2 values(8,'b') | 35.57 s | | 118 | 63 | NULL | select thd_id,conn_id,state,cu ... .session where command='Query' | 65.26 ms | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ 4 rows in set (0.06 sec)

发现Connection 3，Connection4 仍然受到阻塞。

Connection 1 查询结束后查询各会话运行情况

[root@GreatSQL][test]>select thd_id,conn_id,state,current_statement,statement_latency from sys.session where command='Query'; +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | thd_id | conn_id | state | current_statement | statement_latency | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ | 124 | 69 | Waiting for global read lock | insert into t2 values(8,'b') | 57.44 s | | 118 | 63 | executing | select thd_id,conn_id,state,cu ... .session where command='Query' | 2.06 ms | +--------+---------+------------------------------+-------------------------------------------------------------------+-------------------+ 2 rows in set (0.06 sec)

Connection 3 成功执行，Connection 4的insert 仍然受到阻塞。

显示执行unlock tables 命令后，Connection4的insert才执行完成。

实验结论：

由上面两个实验得出，诊断刷新锁争用的问题时，只要有会话处于 Waiting for table flush状态，说明曾发生过刷新表的操作，无论当前能否看到flush tables的相关会话，而通常处于Waiting for table flush状态的会话之前发生的慢查询都有可能是造成后续阻塞的原因。

flush tables with read lock语句要获取的全局读锁，在等待获取锁时，症状与flush tables语句差不多，不同的是：

1.flush tables with read lock等待获取锁及得到锁之后，都会阻止所有表的写入，而flush tables只是在执行过程中持有锁，它不会阻止长查询之外的其他表写操作

2.flush tables with read lock需要通过unlock tables 显示释放锁，而flush tables不需要。

为什么flush table或者flush tables with read lock 会话都结束了，后续的查询还是会被阻塞呢？

这是低版本表定义缓存（TDC）的原因，这两条命令都会close all open tables，将表版本推高（refresh_version +1）, 但因为长查询线程的存在，导致旧表无法被close，在访问旧表时都会认为是旧版本，等待 TABLE cache flush，而refresh_version 的推高是不可逆的结果，也就是说即使发出flush table或flush tables with read lock 的会话中断了，但是实际产生的 TABLE flush 的效果还是存在的。

另外这个症状与隔离级别关系不大，笔者测试了READ COMMITTED, REPEATABLE READ两种隔离级别，症状都相同。

通常除了手动发出这两个命令，使用mysqldump工具进行备份时也会发出这两个命令。

mysqldump备份加哪种选项会触发命令flush tables

打开general log，进行多次dump测试实验，发现有以下几种情况会触发flush tables命令

1.--flush-logs,--single-transaction 一起使用时，触发flush tables，flush tables with read lock

2.--source-data 不和--single-transaction搭配使用时，触发FLUSH /*!40101 LOCAL */ TABLES, FLUSH TABLES WITH READ LOCK

3.--flush-logs,--single-transaction,--source-data这三个选项同时使用时，会触发FLUSH /*!40101 LOCAL */ TABLES, FLUSH TABLES WITH READ LOCK

DBA小伙伴要熟悉备份工具各个选项或者选项组合使用时带来效果，尽量避免在业务高峰进行备份操作。

结语

Flush table 的功能是关闭所有已经打开的表，强制关闭所有正在使用的表，然而，正在使用的表对象是不能关闭的，所以Flush Tables操作会被正在运行的SQL请求阻塞，而在Flush table 之后的SQL请求又会被Flush table会话阻塞，即使Flush table会话被取消了，这些发生在Flush table之后的SQL请求也还是会被阻塞。所以当会话出现大量waiting for table flush时，无论当前是否还存在flush table 命令，查询耗时比这些waiting会话更久的慢查询，将其kill掉才能解决问题。

GreatSQL 的刷新锁