背景:
数据库版本:3.2.3.3
数据库架构:2-2-2,指定了primary_zone为其中一个zone,所以主节点是两台server。
前几天从某天开始,每天下午有一个OceanBase数据库的一个主节点cpu都会被被打满,持续了三天,通过processlist没办法定位到问题sql,cpu高了之后导致在执行的sql都变慢,一直没有定位到导致cpu高的sql,业务反馈也没有新上业务。(因为处理问题有些信息没有截图和记录只能文字描述)
问题1定位:
在第三天下午cpu被打爆,导致主zone一个节点被打爆,切主之后,新的主zone的一个server的cpu迅速也被打满,宕掉的server起来之后,调整副本策略,设置两个zone为primary_zone,4台server为主节点,查询process_list这个节点上的sql,抓到问题sql。
统计process_list的信息语句如下
select user,sql_id,avg(retry_cnt),avg(time),count(*),id from __all_virtual_processlist where svr_ip='111.11.10.111' and sql_id <> '' group by user,sql_id order by count(*) desc limit 10;
抓到几个这种sql,这是一个案例,有很多sql但都是这两个表关联,联系业务停下这个业务,cpu恢复,问题sql定位。
SELECT A.SSSID SSSID, A.SSSSSNUM SSSSSNUM, A.NNNTYPR NNNTYPR, A.CITY CITY, B.CCCID CCCID, B.CCCNAME CCCNAME, B.CCCETYPE CCCETYPE, B.CCCDDR CCCDDR FROM USER A, CUSTOWNER B WHERE A.CITY = B.CITY AND B.CERTID = ? AND A.STATUS = 'UCCC' AND A.CCCID = B.CCCID AND A.CCCID = A.USERID AND B.STATUS = 'stTTC' AND A.ACTIVE = 1 AND A.NNNTYPR = 'GSG' AND B.CUSTTYPE = 'pppSDE' ;
看了下都是USER A, CUSTOWNER B这两张表关联的sql,检查执行计划是A关联B走nl,将谓词条件拆出来看下ab表的数据量。
SELECT count(*) from USER A where A.STATUS = 'UCCC' AND A.CCCID = A.USERID AND A.ACTIVE = 1 AND A.NNNTYPR = 'GSG' ;
SELECT count(*) from CUSTOWNER B WHERE B.CERTID = ? AND B.STATUS = 'stTTC' AND B.CUSTTYPE = 'pppSDE' ;
结果b表出来只有几条数据,a表数据量比较大,先把抓到的这些sql绑定执行计划
/*+leading(b,a) use_nl(a)*/
有几个复杂的sql可能单独调整下hint如/*+no_expand leading(b,a) use_nl(a)*/,去explain extended看下结果的hintdata然后绑定下outline。
/*+
LEADING(@"SEL$2" ("ccc.B"@"SEL$2" "ccc.A"@"SEL$2" ))
USE_NL(@"SEL$2" ("ccc.A"@"SEL$2" ))
LEADING(@"SEL$3" ("ccc.B"@"SEL$3" "ccc.A"@"SEL$3" ))
USE_NL(@"SEL$3" ("ccc.A"@"SEL$3" ))
NO_EXPAND(@"SEL$2")
*/
调整完这几个sql之后,需要定位计划走错的原因。
检查下有没有手工收集统计信息,ob每天合并会自动收集存储层面的统计信息,手工收集统计信息之后,就只会使用手工收集的统计信息。
obclient> select * from all_tab_statistics where LAST_ANALYZED is not null and rownum <10;
+-------+----------------------------+----------------+--------------------+-------------------+-----------------------+-------------+----------+--------+--------------+-----------+-----------+-------------+---------------------------+---------------------+-------------------+---------------------+---------------+----------------+---------------------+-----------+-------------+---------------------+--------------+------------+-----------------+-------------+-------+
| OWNER | TABLE_NAME | PARTITION_NAME | PARTITION_POSITION | SUBPARTITION_NAME | SUBPARTITION_POSITION | OBJECT_TYPE | NUM_ROWS | BLOCKS | EMPTY_BLOCKS | AVG_SPACE | CHAIN_CNT | AVG_ROW_LEN | AVG_SPACE_FREELIST_BLOCKS | NUM_FREELIST_BLOCKS | AVG_CACHED_BLOCKS | AVG_CACHE_HIT_RATIO | IM_IMCU_COUNT | IM_BLOCK_COUNT | IM_STAT_UPDATE_TIME | SCAN_RATE | SAMPLE_SIZE | LAST_ANALYZED | GLOBAL_STATS | USER_STATS | STATTYPE_LOCKED | STALE_STATS | SCOPE |
+-------+----------------------------+----------------+--------------------+-------------------+-----------------------+-------------+----------+--------+--------------+-----------+-----------+-------------+---------------------------+---------------------+-------------------+---------------------+---------------+----------------+---------------------+-----------+-------------+---------------------+--------------+------------+-----------------+-------------+-------+
| TBCS | CUSTOWNER | NULL | NULL | NULL | NULL | TABLE | 31534293 | NULL | NULL | NULL | NULL | 892 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2023-06-16 09:39:37 | NO | NO | NULL | NO | NULL |
| TBCS | USER | NULL | NULL | NULL | NULL | TABLE | 23824652 | NULL | NULL | NULL | NULL | 1112 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2023-06-16 09:45:42 | NO | NO | NULL | NO | NULL |
检查发现这两个表有手工收集的统计信息,但是列上没有,检查分区all_part_col_statistics统计信息也没有,询问同事前几天有做手工收集统计信息,收集之后是周末,第三天也就是周一下午cpu开始飙高的。收集语句如下
CALL dbms_stats.gather_table_stats('user', 'tbl1', granularity=>'GLOBAL', method_opt=>'FOR ALL COLUMNS SIZE 128',degree=>'24', estimate_percent=> '100');
这个语句是没错的,但是DBA_TAB_STATS_HISTORY没有查到信息,与研发沟通,说可能有bug,类似表中有乱码等情况收集采样比100%的话有可能会失败,导致后续操作失败。
先删除手工收集的统计信息
call dbms_stats.delete_table_stats(user,table);
问题2定位:
在处理完问题1之后第三天下午该库的主节点一个cpu又打到了70多。检查process_list。
select substr(info, 1, 20) ,avg(time),count(*) from __all_virtual_processlist
where tenant='crm1' and state='ACTIVE' and command<>'Stmt fetch' and svr_ip='111.15.10.122' group by substr(info, 1, 20) order by count(*) desc;
+----------------------+-----------+----------+
| substr(info, 1, 20) | avg(time) | count(*) |
+----------------------+-----------+----------+
| SELECT CCID, CITY , | 0.7576 | 33 |
可以看到类似sql常量不一样,但是有大量的执行。检查下执行效率和执行计划。
select sql_id,
svr_ip,
hit_count,
executions,
avg_exe_usec,
slowest_exe_usec,
plan_id,
type,
last_active_time
from gv$plan_cache_plan_stat
where tenant_id = 1001
and last_active_time > '2023-06-23 14:30:00.000000'
and sql_id='EFE4363A9EC4841EFE8715E80A3109E7'
order by executions desc,hit_count desc limit 10;
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+
| sql_id | svr_ip | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id | type | last_active_time |
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.121 | 374 | 375 | 2179325 | 3547953 | 9889148 | 2 | 2023-06-23 17:36:56.397197 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.122 | 374 | 375 | 2177694 | 3547009 | 28629655 | 1 | 2023-06-23 17:36:56.395365 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.122 | 336 | 337 | 2194577 | 3595186 | 28623540 | 1 | 2023-06-23 17:37:04.149222 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.122 | 4 | 5 | 2066376 | 2275247 | 28588358 | 1 | 2023-06-23 16:09:20.864789 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.122 | 2 | 3 | 1620046 | 1712033 | 28604983 | 1 | 2023-06-23 16:09:55.446330 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.121 | 1 | 2 | 2631324 | 3251847 | 9883622 | 2 | 2023-06-23 16:10:18.512018 |
| EFE4363A9EC4841EFE8715E80A3109E7 | 111.15.10.122 | 1 | 2 | 1820933 | 1848855 | 28618588 | 1 | 2023-06-23 16:10:32.327687 |
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+
7 rows in set (0.25 sec)
select plan_id, operator, name, rows, cost from gv$plan_cache_plan_explain where tenant_id=1001 and ip='111.15.10.122' and port=2882 and plan_id=28629655;
+----------+-----------------+------------------+------+--------+
| plan_id | operator | name | rows | cost |
+----------+-----------------+------------------+------+--------+
| 28629655 | PHY_SORT | NULL | 27 | 437728 |
| 28629655 | PHY_TABLE_SCAN | CCC_AA_OD | 27 | 437702 |
+----------+-----------------+------------------+------+--------+
2 rows in set (0.01 sec)
这类sql走全表扫,平均执行时间2s。
看下这类里的一个sql,单看这个语句,第一眼确认感觉不好创建索引,筛选性强的列都有函数或者表达式。
SELECT XXXXXXX FROM CCC_AA_OD WHERE CITY = 222 AND STATUS IN(0, 4) AND (NVL(ECETIMS, 0) = 0 OR (NVL(ECETIMS, 0) > 0 AND (SYSDATE - EXEDATE)*24*60 > ?)) AND MOD(ROWDDNUMBER, 20) = 9 ORDER BY ORRRTIME;
研发给了一个建议
obclient> CREATE INDEX IDX_CCC_AA_OD_01 ON TCCC.CCC_AA_OD(CITY, ORRRTIME, STATUS, ECETIMS, EXEDATE, ROWDDNUMBER);
再检查下,速度快了不少,检查gv$sql_audit刚执行完的语句基本都是100ms
select sql_id,
svr_ip,
hit_count,
executions,
avg_exe_usec,
slowest_exe_usec,
plan_id,
type,
first_load_time,
last_active_time
from gv$plan_cache_plan_stat
where tenant_id = 1001
and last_active_time > '2023-06-23 14:30:00.000000'
and sql_id='3FA024B7294BB415B3EC2B87D19E9DE3'
order by executions desc,hit_count desc limit 10;
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+----------------------------+
| sql_id | svr_ip | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id | type | first_load_time | last_active_time |
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+----------------------------+
| 3FA024B7294BB415B3EC2B87D19E9DE3 | 111.15.10.122 | 7 | 8 | 150569 | 170092 | 31476168 | 1 | 2023-06-23 18:56:46.544428 | 2023-06-23 18:59:40.959179 |
| 3FA024B7294BB415B3EC2B87D19E9DE3 | 111.15.10.121 | 6 | 7 | 166505 | 186997 | 10815017 | 2 | 2023-06-23 18:59:29.975739 | 2023-06-23 19:02:13.314771 |
| 3FA024B7294BB415B3EC2B87D19E9DE3 | 111.15.10.122 | 6 | 7 | 142653 | 162827 | 31503109 | 1 | 2023-06-23 18:59:29.984177 | 2023-06-23 19:02:13.296264 |
+----------------------------------+---------------+-----------+------------+--------------+------------------+----------+------+----------------------------+----------------------------+
6 rows in set (0.28 sec)
MySQL [oceanbase]> select plan_id, operator, name, rows, cost from gv$plan_cache_plan_explain where tenant_id=1001 and ip='111.15.10.122' and port=2882 and plan_id=31476168;
+----------+----------------+-------------------------------------------+------+-------+
| plan_id | operator | name | rows | cost |
+----------+----------------+-------------------------------------------+------+-------+
| 31476168 | PHY_TABLE_SCAN | CS_BAT_ORDERPROD(IDX_CCC_AA_OD_01) | 76 | 93729 |
+----------+----------------+-------------------------------------------+------+-------+
问题分析:
1.ob中执行手工收集统计信息的过程是这样,先收集表的统计信息,再收集列的统计信息,最后把kvcache中的统计信息的缓存刷出去,但是我们这次问题的手工收集统计信息看起来是收集完表的统计信息之后出了问题,没有收集列的统计信息,没有刷缓存,所以手工收集完统计信息后两天没有出问题。
2.问题2的业务因为到了业务的一个活动时间,平时执行量是1000,活动时间业务量增加到80w,所以cache被打满,统计信息被刷出缓存,问题1再执行就会再从内部表取统计信息,这时候统计信息就是手工收集的没有列信息的统计信息,导致问题1sql执行计划变化,两个问题集中一起,把cpu打爆了。
3.问题2的sql优化,为什么要这么建索引,正常CITY,ROWDDNUMBER这两个列就可以,因为当时要求快速处理对数据没仔细分析。解释下为什么这么创建,CITY虽然筛选性不强,但是等值,对于数据量很小的CITY效果很好,ROWDDNUMBER该列筛选性很强但是where条件中是MOD(ROWDDNUMBER, 20)的谓词形式,正常创建单列ROWDDNUMBER索引肯定走不上,创建CITY,ROWDDNUMBER组合索引后,就可以走上该索引,MOD(ROWDDNUMBER, 20)的条件会在索引层面就进行过滤,因为该列过滤性强,回表代价小,效果比全表扫好很多。
结论:
雪崩之下没有一片雪花是无辜的,平稳的系统下可能已经暗流涌动了,所以我们一方面要提升自己能力,能够多做一些事情,另一方面要做好系统的保障和优化,尽量把问题扼杀再摇篮里。
行之所向,莫问远方。