背景
20230511日中午12:30某OceanBase 3.2.3.3的核心业务系统主节点cpu打满
应急处理
1.CPU打满首先排查当前在执行的sql,找到怀疑sql
MySQL [oceanbase]> select user,sql_id,avg(retry_cnt),avg(time),count(*),id from __all_virtual_processlist where tenant='obsc4' group by user,sql_id order by count(*) desc limit 10;
+-----------+----------------------------------+----------------+-------------+----------+------------+
| user | sql_id | avg(retry_cnt) | avg(time) | count(*) | id |
+-----------+----------------------------------+----------------+-------------+----------+------------+
| USER01 | | 0.0103 | 2326.2287 | 26591 | 3222268431 |
| USER012 | | 0.0000 | 13033.9948 | 194 | 3222084708 |
| USER01 | 665C557D49877E315A74446D7772F8D3 | 0.0000 | 62.9873 | 158 | 3222247595 | --只有这条sql在执行,量最大
| USER013 | | 0.0000 | 28230.9273 | 55 | 3222195580 |
| USER014 | | 0.0000 | 29632.5833 | 36 | 3222167150 |
| USER015 | | 0.0000 | 125545.6923 | 26 | 3222042880 |
| SYS | | 0.0000 | 28792.2500 | 12 | 3221714226 |
| USER01 | 04FA368F47CD0EF9D266DD4D4A55FBB3 | 0.0000 | 0.0000 | 10 | 3222142031 |
| USER01 | 2DB583A135F59A813B028EECE8BE513C | 0.0000 | 0.0000 | 8 | 3222051957 |
| USER01 | FC7AE7BE1042830034BD9BB6E1290BA8 | 0.2500 | 0.0000 | 4 | 3222087848 |
+-----------+----------------------------------+----------------+-------------+----------+------------+
2.应急先杀session降cpu,因CPU满了可能杀的较慢,不确定执行节点的情况下,也可以放出svi_ip查到执行节点
select distinct concat('kill ',id,';') from __all_virtual_processlist where sql_id='665C557D49877E315A74446D7772F8D3';
分析过程
1.查看怀疑sql是否有问题,先看下sql
MySQL [oceanbase]> select
-> query_Sql
-> from gv$plan_cache_plan_stat where sql_id IN ('665C557D49877E315A74446D7772F8D3' ) order by sql_id;
SELECT 1
FROM DUAL
WHERE EXISTS (SELECT 1
FROM (SELECT (CASE
WHEN T.SSSOPENSCENCE = '4' THEN
(SELECT 1
FROM CCCATT A
WHERE A.AAATYPE = 'cscscs'
AND A.AAAID = ?
AND A.RRRTYPE = 'MMM22'
AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 )
AND ROWNUM < 2)
ELSE
1
END) VAL
FROM GGGGGG_MMMM_SSS T
WHERE T.SBSID = ?) B
WHERE B.VAL = 1)
OR NOT EXISTS
(SELECT 1 FROM GGGGGG_MMMM_SSS WHERE SBSID = ?)
2.查看历史执行情况
MySQL [oceanbase]> select
-> sql_id,
-> hit_count,
-> executions,
-> avg_exe_usec,
-> slowest_exe_usec,
-> plan_id,
-> type,
-> last_active_time
-> from gv$plan_cache_plan_stat where sql_id IN ('665C557D49877E315A74446D7772F8D3' ) order by sql_id;
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| sql_id | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id | type | last_active_time |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| 665C557D49877E315A74446D7772F8D3 | 155 | 165 | 136677078 | 195528081 | 196569413 | 3 | 2023-05-11 12:31:03.559336 | --平均执行时间两分钟了
| 665C557D49877E315A74446D7772F8D3 | 187 | 200 | 137414269 | 191674925 | 6670378 | 3 | 2023-05-11 12:31:05.717038 |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
3.可以看到历史该sql执行时间较长,查看执行计划
MySQL [oceanbase]> select
-> plan_id,
-> operator,
-> name,
-> rows,
-> cost
-> from __all_virtual_plan_cache_plan_explain where svr_ip ='134.85.20.39' and svr_port =2882 and tenant_id=1001 and plan_id =6670378;
+---------+-------------------------------+--------------------------------------------------------+------+------+
| plan_id | operator | name | rows | cost |
+---------+-------------------------------+--------------------------------------------------------+------+------+
| 6670378 | PHY_SUBPLAN_FILTER | NULL | 1 | 480 |
| 6670378 | PHY_EXPR_VALUES | NULL | 1 | 0 |
| 6670378 | PHY_LIMIT | NULL | 1 | 434 |
| 6670378 | PHY_SUBPLAN_FILTER | NULL | 1 | 434 |
| 6670378 | PHY_TABLE_SCAN | T(IDX_GRPGRPGRP_SBSID) | 1 | 91 |
| 6670378 | PHY_LIMIT | NULL | 1 | 343 |
| 6670378 | PHY_PX_FIFO_COORD | NULL | 1 | 343 |
| 6670378 | PHY_PX_REDUCE_TRANSMIT | NULL | 1 | 342 |
| 6670378 | PHY_LIMIT | NULL | 1 | 342 |
| 6670378 | PHY_NESTED_LOOP_JOIN | NULL | 1 | 342 |
| 6670378 | PHY_PX_FIFO_RECEIVE | NULL | 1 | 319 |
| 6670378 | PHY_PX_DIST_TRANSMIT | NULL | 1 | 319 |
| 6670378 | PHY_TABLE_SCAN | SA_DB_DICDIC(INX_SA_DB_DICDIC_GROUPID) | 1 | 319 |
| 6670378 | PHY_GRANULE_ITERATOR | NULL | 1 | 32 |
| 6670378 | PHY_TABLE_SCAN | A(IDX_CCCATT_UNION) | 1 | 32 |
| 6670378 | PHY_TABLE_SCAN | GGGGGG_MMMM_SSS(IDX_GRPGRPGRP_SBSID) | 1 | 45 |
+---------+-------------------------------+--------------------------------------------------------+------+------+
4.可以看到执行计划中都走了索引,那么从两个方向排查,1、链接方式有问题2、索引选择有问题,查看表上索引
select index_name, listagg(column_name, ',') within group (order by column_position)
from all_ind_columns
where table_name = upper('SA_DB_DICDIC')
group by index_name;
+----------------------------+-------------------------------------------------------------+
| INDEX_NAME | LISTAGG(COLUMN_NAME,',')WITHINGROUP(ORDERBYCOLUMN_POSITION) |
+----------------------------+-------------------------------------------------------------+
| INX_SA_DB_DICDIC_GROUPID | GROUPID |
| PK_SA_DB_DICDIC | DICTID,GROUPID |
+----------------------------+-------------------------------------------------------------+
select index_name, listagg(column_name, ',') within group (order by column_position)
from all_ind_columns
where table_name = upper('CCCATT')
group by index_name;
+-------------------------------+-------------------------------------------------------------+
| INDEX_NAME | LISTAGG(COLUMN_NAME,',')WITHINGROUP(ORDERBYCOLUMN_POSITION) |
+-------------------------------+-------------------------------------------------------------+
| IDX_CCCATT | AAAID,RRRTYPE,REGION |
| IDX_CCCATT_RRRTYPEVALUE | RRRTYPE,VALUE |
| IDX_CCCATT_UNION | AAATYPE,RRRTYPE,VALUE |
+-------------------------------+-------------------------------------------------------------+
5.查看各表谓词信息过滤性
SELECT count(DICTID) FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1;
+----------+
| COUNT(DICTID) |
+----------+
| 6 |
+----------+
1 rows in set (0.001 sec)
SELECT COUNT(*),COUNT(DISTINCT AAAID),count(distinct AAATYPE)
FROM CCCATT A
WHERE A.AAATYPE = 'cscscs'
AND A.RRRTYPE = 'MMM22'
AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 );
+----------+--------------------------+---------------------------+
| COUNT(*) | COUNT(DISTINCTAAAID) | COUNT(DISTINCTAAATYPE) |
+----------+--------------------------+---------------------------+
| 1693324 | 1693317 | 1 |
+----------+--------------------------+---------------------------+
1 row in set (12.419 sec)
SELECT COUNT(*),COUNT(SBSID) FROM GGGGGG_MMMM_SSS;
+----------+---------------+
| COUNT(*) | COUNT(SBSID) |
+----------+---------------+
| 1434578 | 1434578 |
+----------+---------------+
1 row in set (0.571 sec)
6.因为DDDD_IIII表从上面结果看是小表,而且扫描很快基本可以排除链接问题,但是A表用了IDX_CCCATT_UNION索引,从前面查询筛选性上很差不如IDX_CCCATT索引基本可以直接过滤掉99.999的数据;
从上面过滤性分析中可以看出AAAID条件过滤性很强,所以我随即带入一个数值应该就能复现,测试下
obclient [USER01]> SELECT 1
-> FROM DUAL
-> WHERE EXISTS (SELECT 1
-> FROM (SELECT (CASE
-> WHEN T.SSSOPENSCENCE = '4' THEN
-> (SELECT /*+index(A IDX_CCCATT)*/ 1
-> FROM CCCATT A
-> WHERE A.AAATYPE = 'cscscs'
-> AND A.AAAID = 555555555555
-> AND A.RRRTYPE = 'MMM22'
-> AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 )
-> AND ROWNUM < 2)
-> ELSE
-> 1
-> END) VAL
-> FROM GGGGGG_MMMM_SSS T
-> WHERE T.SBSID = 555555555555) B
-> WHERE B.VAL = 1)
-> OR NOT EXISTS
-> (SELECT 1 FROM GGGGGG_MMMM_SSS WHERE SBSID = 555555555555);
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.014 sec)
可以看到效率很快
7.抓下我这时时的outline data直接绑定上就行
select /*+parallel(32)*/ SQL_ID from gv$sql_audit where tenant_id = 1001 and query_sql like '%A.AAAID = 555555555555%' order by elapsed_time desc limit 10\G
SQL_ID: D4EA6C918CFA742C8C31F7C94E36EAD5
select outline_id,sql_id, OUTLINE_DATA from gv$plan_cache_plan_stat where sql_id ='D4EA6C918CFA742C8C31F7C94E36EAD5'\G
*************************** 1. row ***************************
outline_id: -1
sql_id: D4EA6C918CFA742C8C31F7C94E36EAD5
OUTLINE_DATA: /*+ BEGIN_OUTLINE_DATA INDEX(@"SEL$3" "USER01.T"@"SEL$3" "IDX_GRPGRPGRP_SBSID") LEADING(@"SEL$7" ("USER01.A"@"SEL$7" "USER013.DDDD_IIII"@"SEL$7" )) USE_NL(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) PQ_DISTRIBUTE(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" ) BC2HOST NONE) NO_USE_NL_MATERIALIZATION(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) INDEX(@"SEL$7" "USER01.A"@"SEL$7" "IDX_CCCATT") INDEX(@"SEL$7" "USER013.SA_DB_DICDIC"@"SEL$7" "PK_SA_DB_DICDIC") INDEX(@"SEL$6" "USER01.GGGGGG_MMMM_SSS"@"SEL$6" "IDX_GRPGRPGRP_SBSID") END_OUTLINE_DATA*/
业务租户的业务账号登陆数据库,绑定outline
obclient [USER01]> create OUTLINE OLT_665C557D49877E315A74446D7772F8D3 ON '665C557D49877E315A74446D7772F8D3' USING HINT
-> /*+ BEGIN_OUTLINE_DATA INDEX(@"SEL$3" "USER01.T"@"SEL$3" "IDX_GRPGRPGRP_SBSID") LEADING(@"SEL$7" ("USER01.A"@"SEL$7" "USER013.DDDD_IIII"@"SEL$7" )) USE_NL(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) PQ_DISTRIBUTE(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" ) BC2HOST NONE) NO_USE_NL_MATERIALIZATION(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) INDEX(@"SEL$7" "USER01.A"@"SEL$7" "IDX_CCCATT") INDEX(@"SEL$7" "USER013.SA_DB_DICDIC"@"SEL$7" "PK_SA_DB_DICDIC") INDEX(@"SEL$6" "USER01.GGGGGG_MMMM_SSS"@"SEL$6" "IDX_GRPGRPGRP_SBSID") END_OUTLINE_DATA*/;
Query OK, 0 rows affected (0.037 sec)
sys租户刷下plancache
MySQL [oceanbase]> alter system flush plan cache sql_id='665C557D49877E315A74446D7772F8D3' tenant='obsc4' global;
8.检查下当前效率
MySQL [oceanbase]> select
-> sql_id,
-> hit_count,
-> executions,
-> avg_exe_usec,
-> slowest_exe_usec,
-> plan_id,
-> type,
-> last_active_time
-> from gv$plan_cache_plan_stat where sql_id IN ('665C557D49877E315A74446D7772F8D3' ) order by sql_id;
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| sql_id | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id | type | last_active_time |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| 665C557D49877E315A74446D7772F8D3 | 51 | 52 | 4393| 11333| 196923939| 3 | 2023-05-11 12:56:00.133583 |
| 665C557D49877E315A74446D7772F8D3 | 63 | 64 | 2803| 9929| 6682408| 3 | 2023-05-11 12:56:00.061327 |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
总结
当核心业务出现问题时候,一定要快刀斩乱麻,先恢复。在业务无人认领sql的情况下,影响业务的sql杀掉还会再上来,所以需要尽快优化,对于瓶颈点无法细致分析可以进行尽善尽美的优化,那么就需要先把关键问题搞定,主大计者,必执简以御繁。
行之所向,莫问远方。