问题复现
执行如下命令:
mysql> SET GLOBAL EVENT_SCHEDULER = OFF;Query OK, 0 rows affected (0.01 sec)mysql> CREATE EVENT e1 ON SCHEDULE EVERY 1 SECOND ENDS NOW() + INTERVAL 1 SECOND DO SELECT 1; Query OK, 0 rows affected (0.01 sec)mysql> SET GLOBAL EVENT_SCHEDULER = ON;Query OK, 0 rows affected (0.04 sec)mysql> show events;Empty set (0.01 sec)
正常情况下,这个events会被自动删除。 如果在MGR集群上执行上面的命令,则根据编译版本不同表现不同。 Debug版本:直接ASSERT
#0 __GI_raise (sig=...) at raise.c:50#1 0x00007fc873279535 in __GI_abort () at abort.c:79#2 0x00007fc87327940f in __assert_fail_base (fmt=..., assertion=..., file=..., line=..., function=...) at assert.c:92#3 0x00007fc873287102 in __GI___assert_fail (assertion=..., file=..., line=..., function=...) at assert.c:101#4 0x000055e3f8f44bab in Disable_autocommit_guard::~Disable_autocommit_guard (this=..., __in_chrg=...) at thd_raii.h:79#5 0x000055e3f985561f in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:390#6 0x000055e3f985b4ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568#7 0x000055e3f985a372 in event_scheduler_thread (arg=...) at event_scheduler.cc:279#8 0x000055e3faee89df in pfs_spawn_thread (arg=...) at pfs.cc:2899#9 0x00007fc87341efa3 in start_thread (arg=...) at pthread_create.c:486#10 0x00007fc87334feff in clone () at clone.S:95
Release版本:导致MGR集群状态异常,节点退出MGR集群。
2022-06-22T14:32:18.412045Z 15 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
问题分析
调试Debug版本,很容易发现导致assert的原因在recalculate_activation_times
函数中:
recalculate_activation_times {ret = trans_commit_stmt(thd) || trans_commit(thd);}
是由于trans_commit_stmt
事务提交失败,导致Disable_autocommit_guard
中的事务状态判断失败。 trans_commit_stmt
失败的原因是group_replication_trans_before_commit
失败。
#0 group_replication_trans_before_commit (param=...) at observer_trans.cc:587#1 0x0000559358f2cb04 in Trans_delegate::before_commit (this=..., thd=..., all=..., trx_cache_log=..., stmt_cache_log=..., cache_log_max_size=..., is_atomic_ddl_arg=...) at rpl_handler.cc:610#2 0x0000559358e8d33d in MYSQL_BIN_LOG::commit (this=..., thd=..., all=...) at binlog.cc:8454#3 0x0000559357909667 in ha_commit_trans (thd=..., all=..., ignore_global_read_lock=...) at handler.cc:1801#4 0x00005593582613c7 in trans_commit_stmt (thd=..., ignore_global_read_lock=...) at transaction.cc:532#5 0x000055935840f56c in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:447#6 0x00005593584154ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568
熟悉MGR代码的知道,后台applier线程才是事件处理的真正线程,用户线程失败肯定是applier线程处理出现问题导致,applier线程出现问题的地方在set_transaction_ctx
中,其调用堆栈为:
#0 0x000056386186e909 in set_transaction_ctx (transaction_termination_ctx=...) at rpl_transaction_ctx.cc:107#1 0x00007f8e5220afad in Certification_handler::handle_transaction_id (this=..., pevent=..., cont=...) at certification_handler.cc:308#2 0x00007f8e5220a349 in Certification_handler::handle_event (this=..., pevent=..., cont=...) at certification_handler.cc:127#3 0x00007f8e5220982a in Event_handler::next (this=..., event=..., continuation=...) at pipeline_interfaces.h:716#4 0x00007f8e5220f20e in Event_cataloger::handle_event (this=..., pevent=..., cont=...) at event_cataloger.cc:53#5 0x00007f8e521b7621 in Applier_module::inject_event_into_pipeline (this=..., pevent=..., cont=...) at applier.cc:258#6 0x00007f8e521b801b in Applier_module::apply_data_packet (this=..., data_packet=..., fde_evt=..., cont=...) at applier.cc:388#7 0x00007f8e521b8fba in Applier_module::applier_thread_handle (this=...) at applier.cc:613#8 0x00007f8e521b692e in launch_handler_thread (arg=...) at applier.cc:50#9 0x00005638634d39df in pfs_spawn_thread (arg=...) at pfs.cc:2899#10 0x00007f8e8d37efa3 in start_thread (arg=...) at pthread_create.c:486
set_transaction_ctx
代码逻辑如下https://gitee.com/mirrors/mysql-server/blob/8.0/sql/rpl_transaction_ctx.cc#L90:
int set_transaction_ctx( Transaction_termination_ctx transaction_termination_ctx) { DBUG_TRACE; DBUG_PRINT("enter", ("thread_id=%lu, rollback_transaction=%d, " "generated_gtid=%d, sidno=%d, gno=%" PRId64, transaction_termination_ctx.m_thread_id, transaction_termination_ctx.m_rollback_transaction, transaction_termination_ctx.m_generated_gtid, transaction_termination_ctx.m_sidno, transaction_termination_ctx.m_gno)); uint error = ER_NO_SUCH_THREAD; Find_thd_with_id find_thd_with_id(transaction_termination_ctx.m_thread_id); THD_ptr thd_ptr = Global_THD_manager::get_instance()->find_thd(&find_thd_with_id); if (thd_ptr) { error = thd_ptr->get_transaction() ->get_rpl_transaction_ctx() ->set_rpl_transaction_ctx(transaction_termination_ctx); } return error;}
这里搜索的thd_ptr为nullptr,导致了返回了ER_NO_SUCH_THREAD
错误。从而导致了认证失败。 为什么会导致无法搜索到对应的thd呢?因为提交这个事务的线程是event_scheduler_thread
事件调度后台线程,而线程搜索算法是忽略这种后台线程。https://gitee.com/mirrors/mysql-server/blob/8.0/sql/mysqld_thd_manager.cc#L59
bool Find_thd_with_id::operator()(THD *thd) { if (thd->get_command() == COM_DAEMON) return false; return (thd->thread_id() == m_thread_id);}
综上所述,原因可以总结如下:
- event_scheduler_thread 后台事件调度线程触发了COMMIT操作
- COMMIT操作经过MGR的applier模块
- applier模块通过Global_THD_manager无法找到后台THD
- applier模块异常 修复思路: 修改线程搜索算法,支持Global_THD_manager可以搜索daemon thread。
反馈社区
打开bugs.mysql.com
- Report a bug
- 填写Bug相关描述信息
- 给出复现步骤
- 给出修改建议
- 附加可以提交测例文件以及修复patch
https://bugs.mysql.com/bug.php?id=107635
Enjoy GreatSQL 🙂