events scheduler导致MGR节点退出详解及修复

数据运维 2023-08-15 张二河手机阅读

问题复现

执行如下命令：

mysql> SET GLOBAL EVENT_SCHEDULER = OFF;Query OK, 0 rows affected (0.01 sec)mysql> CREATE EVENT e1 ON SCHEDULE EVERY 1 SECOND ENDS NOW() + INTERVAL 1 SECOND DO SELECT 1; Query OK, 0 rows affected (0.01 sec)mysql> SET GLOBAL EVENT_SCHEDULER = ON;Query OK, 0 rows affected (0.04 sec)mysql> show events;Empty set (0.01 sec)

正常情况下，这个events会被自动删除。如果在MGR集群上执行上面的命令，则根据编译版本不同表现不同。 Debug版本：直接ASSERT

#0 __GI_raise (sig=...) at raise.c:50#1 0x00007fc873279535 in __GI_abort () at abort.c:79#2 0x00007fc87327940f in __assert_fail_base (fmt=..., assertion=..., file=..., line=..., function=...) at assert.c:92#3 0x00007fc873287102 in __GI___assert_fail (assertion=..., file=..., line=..., function=...) at assert.c:101#4 0x000055e3f8f44bab in Disable_autocommit_guard::~Disable_autocommit_guard (this=..., __in_chrg=...) at thd_raii.h:79#5 0x000055e3f985561f in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:390#6 0x000055e3f985b4ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568#7 0x000055e3f985a372 in event_scheduler_thread (arg=...) at event_scheduler.cc:279#8 0x000055e3faee89df in pfs_spawn_thread (arg=...) at pfs.cc:2899#9 0x00007fc87341efa3 in start_thread (arg=...) at pthread_create.c:486#10 0x00007fc87334feff in clone () at clone.S:95

Release版本：导致MGR集群状态异常，节点退出MGR集群。

2022-06-22T14:32:18.412045Z 15 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'

问题分析

调试Debug版本，很容易发现导致assert的原因在recalculate_activation_times函数中:

recalculate_activation_times {ret = trans_commit_stmt(thd) || trans_commit(thd);}

是由于trans_commit_stmt事务提交失败，导致Disable_autocommit_guard中的事务状态判断失败。 trans_commit_stmt失败的原因是group_replication_trans_before_commit失败。

#0 group_replication_trans_before_commit (param=...) at observer_trans.cc:587#1 0x0000559358f2cb04 in Trans_delegate::before_commit (this=..., thd=..., all=..., trx_cache_log=..., stmt_cache_log=..., cache_log_max_size=..., is_atomic_ddl_arg=...) at rpl_handler.cc:610#2 0x0000559358e8d33d in MYSQL_BIN_LOG::commit (this=..., thd=..., all=...) at binlog.cc:8454#3 0x0000559357909667 in ha_commit_trans (thd=..., all=..., ignore_global_read_lock=...) at handler.cc:1801#4 0x00005593582613c7 in trans_commit_stmt (thd=..., ignore_global_read_lock=...) at transaction.cc:532#5 0x000055935840f56c in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:447#6 0x00005593584154ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568

熟悉MGR代码的知道，后台applier线程才是事件处理的真正线程，用户线程失败肯定是applier线程处理出现问题导致，applier线程出现问题的地方在set_transaction_ctx中，其调用堆栈为：

#0 0x000056386186e909 in set_transaction_ctx (transaction_termination_ctx=...) at rpl_transaction_ctx.cc:107#1 0x00007f8e5220afad in Certification_handler::handle_transaction_id (this=..., pevent=..., cont=...) at certification_handler.cc:308#2 0x00007f8e5220a349 in Certification_handler::handle_event (this=..., pevent=..., cont=...) at certification_handler.cc:127#3 0x00007f8e5220982a in Event_handler::next (this=..., event=..., continuation=...) at pipeline_interfaces.h:716#4 0x00007f8e5220f20e in Event_cataloger::handle_event (this=..., pevent=..., cont=...) at event_cataloger.cc:53#5 0x00007f8e521b7621 in Applier_module::inject_event_into_pipeline (this=..., pevent=..., cont=...) at applier.cc:258#6 0x00007f8e521b801b in Applier_module::apply_data_packet (this=..., data_packet=..., fde_evt=..., cont=...) at applier.cc:388#7 0x00007f8e521b8fba in Applier_module::applier_thread_handle (this=...) at applier.cc:613#8 0x00007f8e521b692e in launch_handler_thread (arg=...) at applier.cc:50#9 0x00005638634d39df in pfs_spawn_thread (arg=...) at pfs.cc:2899#10 0x00007f8e8d37efa3 in start_thread (arg=...) at pthread_create.c:486

set_transaction_ctx代码逻辑如下https://gitee.com/mirrors/mysql-server/blob/8.0/sql/rpl_transaction_ctx.cc#L90：

int set_transaction_ctx( Transaction_termination_ctx transaction_termination_ctx) { DBUG_TRACE; DBUG_PRINT("enter", ("thread_id=%lu, rollback_transaction=%d, " "generated_gtid=%d, sidno=%d, gno=%" PRId64, transaction_termination_ctx.m_thread_id, transaction_termination_ctx.m_rollback_transaction, transaction_termination_ctx.m_generated_gtid, transaction_termination_ctx.m_sidno, transaction_termination_ctx.m_gno)); uint error = ER_NO_SUCH_THREAD; Find_thd_with_id find_thd_with_id(transaction_termination_ctx.m_thread_id); THD_ptr thd_ptr = Global_THD_manager::get_instance()->find_thd(&find_thd_with_id); if (thd_ptr) { error = thd_ptr->get_transaction() ->get_rpl_transaction_ctx() ->set_rpl_transaction_ctx(transaction_termination_ctx); } return error;}

这里搜索的thd_ptr为nullptr，导致了返回了ER_NO_SUCH_THREAD错误。从而导致了认证失败。为什么会导致无法搜索到对应的thd呢？因为提交这个事务的线程是event_scheduler_thread事件调度后台线程，而线程搜索算法是忽略这种后台线程。https://gitee.com/mirrors/mysql-server/blob/8.0/sql/mysqld_thd_manager.cc#L59

bool Find_thd_with_id::operator()(THD *thd) { if (thd->get_command() == COM_DAEMON) return false; return (thd->thread_id() == m_thread_id);}

综上所述，原因可以总结如下：