从库延迟案例分析

从库延迟案例分析

背景介绍

近来一套业务系统,从库一直处于延迟状态,无法追上主库,导致业务风险较大。从资源上看,从库的CPU、IO、网络使用率较低,不存在服务器压力过高导致回放慢的情况;从库开启了并行回放;在从库上执行show processlist看到没有回放线程阻塞,回放一直在持续;解析relay-log日志文件,发现其中并没大事务回放。

过程分析

现象确认

收到运维同事的反馈,有一套从库延迟的非常厉害,提供了show slave status延迟的截图信息

持续观察了一阵show slave status的变化,发现pos点位信息在不停的变化,Seconds_Behind_master也是不停的变化的,总体趋势还在不停的变大。

资源使用

观察了服务器资源使用情况,可以看到占用非常低

观察从库进程情况,基本上只能看到有一个线程在回放工作

并行回放参数说明

在主库设置了binlog_transaction_dependency_tracking=WRITESET

在从库设置了slave_parallel_type=LOGICAL_CLOCKslave_parallel_workers=64

error log日志对比

从error log中取并行回放的日志进行分析

$ grep 010559 100werror3306.log | tail -n 3 2024-01-31T14:07:50.172007+08:00 6806 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'cluster': seconds elapsed = 120; events assigned = 3318582273; worker queues filled over overrun level = 207029; waite d due a Worker queue full = 238; waited due the total size = 0; waited at clock conflicts = 348754579743300 waited (count) when Workers occupied = 34529247 waited when Workers occupied = 76847369713200 2024-01-31T14:09:50.078829+08:00 6806 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'cluster': seconds elapsed = 120; events assigned = 3319256065; worker queues filled over overrun level = 207029; waite d due a Worker queue full = 238; waited due the total size = 0; waited at clock conflicts = 348851330164000 waited (count) when Workers occupied = 34535857 waited when Workers occupied = 76866419841900 2024-01-31T14:11:50.060510+08:00 6806 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'cluster': seconds elapsed = 120; events assigned = 3319894017; worker queues filled over overrun level = 207029; waite d due a Worker queue full = 238; waited due the total size = 0; waited at clock conflicts = 348943740455400 waited (count) when Workers occupied = 34542790 waited when Workers occupied = 76890229805500