记录一则oceanbase数据库cpu打满处理及分析过程

2024年 5月 7日 84.9k 0

背景

20230511日中午12:30某OceanBase 3.2.3.3的核心业务系统主节点cpu打满

应急处理

1.CPU打满首先排查当前在执行的sql,找到怀疑sql

MySQL [oceanbase]> select user,sql_id,avg(retry_cnt),avg(time),count(*),id from __all_virtual_processlist where tenant='obsc4' group by user,sql_id order by count(*) desc limit 10;
+-----------+----------------------------------+----------------+-------------+----------+------------+
| user      | sql_id                           | avg(retry_cnt) | avg(time)   | count(*) | id         |
+-----------+----------------------------------+----------------+-------------+----------+------------+
| USER01    |                                  |         0.0103 |   2326.2287 |    26591 | 3222268431 |
| USER012   |                                  |         0.0000 |  13033.9948 |      194 | 3222084708 |
| USER01    | 665C557D49877E315A74446D7772F8D3 |         0.0000 |     62.9873 |      158 | 3222247595 |  --只有这条sql在执行,量最大
| USER013   |                                  |         0.0000 |  28230.9273 |       55 | 3222195580 |
| USER014   |                                  |         0.0000 |  29632.5833 |       36 | 3222167150 |
| USER015   |                                  |         0.0000 | 125545.6923 |       26 | 3222042880 |
| SYS       |                                  |         0.0000 |  28792.2500 |       12 | 3221714226 |
| USER01    | 04FA368F47CD0EF9D266DD4D4A55FBB3 |         0.0000 |      0.0000 |       10 | 3222142031 |
| USER01    | 2DB583A135F59A813B028EECE8BE513C |         0.0000 |      0.0000 |        8 | 3222051957 |
| USER01    | FC7AE7BE1042830034BD9BB6E1290BA8 |         0.2500 |      0.0000 |        4 | 3222087848 |
+-----------+----------------------------------+----------------+-------------+----------+------------+

2.应急先杀session降cpu,因CPU满了可能杀的较慢,不确定执行节点的情况下,也可以放出svi_ip查到执行节点

select distinct concat('kill ',id,';') from  __all_virtual_processlist where   sql_id='665C557D49877E315A74446D7772F8D3';

分析过程

1.查看怀疑sql是否有问题,先看下sql

MySQL [oceanbase]> select        
    ->  query_Sql
    -> from gv$plan_cache_plan_stat where    sql_id  IN  ('665C557D49877E315A74446D7772F8D3'   ) order by sql_id;

SELECT 1
  FROM DUAL
 WHERE EXISTS (SELECT 1
          FROM (SELECT (CASE
                         WHEN T.SSSOPENSCENCE = '4' THEN
                          (SELECT 1
                             FROM CCCATT A
                            WHERE A.AAATYPE = 'cscscs'
                              AND A.AAAID = ?
                              AND A.RRRTYPE = 'MMM22'
                              AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 )
                              AND ROWNUM < 2)
                         ELSE
                          1
                       END) VAL
                  FROM GGGGGG_MMMM_SSS T
                 WHERE T.SBSID = ?) B
         WHERE B.VAL = 1)
    OR NOT EXISTS
 (SELECT 1 FROM GGGGGG_MMMM_SSS WHERE SBSID = ?) 

2.查看历史执行情况

MySQL [oceanbase]> select        
    ->  sql_id,       
    ->  hit_count,   
    ->  executions,       
    ->  avg_exe_usec,       
    ->  slowest_exe_usec,       
    ->  plan_id,   
    ->  type,       
    ->  last_active_time  
    -> from gv$plan_cache_plan_stat where    sql_id  IN  ('665C557D49877E315A74446D7772F8D3'   ) order by sql_id;
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| sql_id                           | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id   | type | last_active_time           |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| 665C557D49877E315A74446D7772F8D3 |       155 |        165 |    136677078 |        195528081 | 196569413 |    3 | 2023-05-11 12:31:03.559336 |  --平均执行时间两分钟了
| 665C557D49877E315A74446D7772F8D3 |       187 |        200 |    137414269 |        191674925 |   6670378 |    3 | 2023-05-11 12:31:05.717038 |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+

3.可以看到历史该sql执行时间较长,查看执行计划

MySQL [oceanbase]> select        
    ->  plan_id,      
    ->  operator,      
    ->  name,      
    ->  rows,      
    ->  cost        
    -> from         __all_virtual_plan_cache_plan_explain        where svr_ip ='134.85.20.39'        and svr_port =2882        and tenant_id=1001        and plan_id =6670378;
+---------+-------------------------------+--------------------------------------------------------+------+------+
| plan_id | operator                      | name                                                   | rows | cost |
+---------+-------------------------------+--------------------------------------------------------+------+------+
| 6670378 | PHY_SUBPLAN_FILTER            | NULL                                                   |    1 |  480 |
| 6670378 |  PHY_EXPR_VALUES              | NULL                                                   |    1 |    0 |
| 6670378 |  PHY_LIMIT                    | NULL                                                   |    1 |  434 |
| 6670378 |   PHY_SUBPLAN_FILTER          | NULL                                                   |    1 |  434 |
| 6670378 |    PHY_TABLE_SCAN             | T(IDX_GRPGRPGRP_SBSID)                                 |    1 |   91 |
| 6670378 |    PHY_LIMIT                  | NULL                                                   |    1 |  343 |
| 6670378 |     PHY_PX_FIFO_COORD         | NULL                                                   |    1 |  343 |
| 6670378 |      PHY_PX_REDUCE_TRANSMIT   | NULL                                                   |    1 |  342 |
| 6670378 |       PHY_LIMIT               | NULL                                                   |    1 |  342 |
| 6670378 |        PHY_NESTED_LOOP_JOIN   | NULL                                                   |    1 |  342 |
| 6670378 |         PHY_PX_FIFO_RECEIVE   | NULL                                                   |    1 |  319 |
| 6670378 |          PHY_PX_DIST_TRANSMIT | NULL                                                   |    1 |  319 |
| 6670378 |           PHY_TABLE_SCAN      | SA_DB_DICDIC(INX_SA_DB_DICDIC_GROUPID)                 |    1 |  319 |
| 6670378 |         PHY_GRANULE_ITERATOR  | NULL                                                   |    1 |   32 |
| 6670378 |          PHY_TABLE_SCAN       | A(IDX_CCCATT_UNION)                                    |    1 |   32 |
| 6670378 |  PHY_TABLE_SCAN               | GGGGGG_MMMM_SSS(IDX_GRPGRPGRP_SBSID)                   |    1 |   45 |
+---------+-------------------------------+--------------------------------------------------------+------+------+

4.可以看到执行计划中都走了索引,那么从两个方向排查,1、链接方式有问题2、索引选择有问题,查看表上索引

select index_name, listagg(column_name, ',') within group (order by column_position)
from all_ind_columns
where table_name = upper('SA_DB_DICDIC')
group by index_name;
+----------------------------+-------------------------------------------------------------+
| INDEX_NAME                 | LISTAGG(COLUMN_NAME,',')WITHINGROUP(ORDERBYCOLUMN_POSITION) |
+----------------------------+-------------------------------------------------------------+
| INX_SA_DB_DICDIC_GROUPID |   GROUPID                                                     |
| PK_SA_DB_DICDIC            | DICTID,GROUPID                                              |
+----------------------------+-------------------------------------------------------------+
select index_name, listagg(column_name, ',') within group (order by column_position)
from all_ind_columns
where table_name = upper('CCCATT')
group by index_name;
+-------------------------------+-------------------------------------------------------------+
| INDEX_NAME                    | LISTAGG(COLUMN_NAME,',')WITHINGROUP(ORDERBYCOLUMN_POSITION) |
+-------------------------------+-------------------------------------------------------------+
| IDX_CCCATT                    | AAAID,RRRTYPE,REGION                                   |
| IDX_CCCATT_RRRTYPEVALUE       | RRRTYPE,VALUE                                              |
| IDX_CCCATT_UNION              | AAATYPE,RRRTYPE,VALUE                                   |
+-------------------------------+-------------------------------------------------------------+

5.查看各表谓词信息过滤性

SELECT count(DICTID) FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1;  
 +----------+
 | COUNT(DICTID) |
 +----------+
 |  6       |
 +----------+
1 rows in set (0.001 sec)
SELECT COUNT(*),COUNT(DISTINCT AAAID),count(distinct AAATYPE)
                             FROM CCCATT A
                            WHERE A.AAATYPE = 'cscscs'
                              AND A.RRRTYPE = 'MMM22'
                              AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 );
+----------+--------------------------+---------------------------+
| COUNT(*) | COUNT(DISTINCTAAAID) | COUNT(DISTINCTAAATYPE) |
+----------+--------------------------+---------------------------+
|  1693324 |                  1693317 |                         1 |
+----------+--------------------------+---------------------------+
1 row in set (12.419 sec)		  


SELECT COUNT(*),COUNT(SBSID)   FROM GGGGGG_MMMM_SSS;
+----------+---------------+
| COUNT(*) | COUNT(SBSID) |
+----------+---------------+
|  1434578 |       1434578 |
+----------+---------------+
1 row in set (0.571 sec)

6.因为DDDD_IIII表从上面结果看是小表,而且扫描很快基本可以排除链接问题,但是A表用了IDX_CCCATT_UNION索引,从前面查询筛选性上很差不如IDX_CCCATT索引基本可以直接过滤掉99.999的数据;

从上面过滤性分析中可以看出AAAID条件过滤性很强,所以我随即带入一个数值应该就能复现,测试下

obclient [USER01]> SELECT 1
    ->       FROM DUAL
    ->      WHERE EXISTS (SELECT 1
    ->               FROM (SELECT (CASE
    ->                              WHEN T.SSSOPENSCENCE = '4' THEN
    ->                               (SELECT /*+index(A IDX_CCCATT)*/ 1
    ->                                  FROM CCCATT A
    ->                                 WHERE A.AAATYPE = 'cscscs'
    ->                                   AND A.AAAID = 555555555555
    ->                                   AND A.RRRTYPE = 'MMM22'
    ->                                   AND A.VALUE IN ( SELECT DICTID FROM DDDD_IIII WHERE GROUPID='MMM22Dict' AND STATUS=1 )
    ->                                   AND ROWNUM < 2)
    ->                              ELSE
    ->                               1
    ->                            END) VAL
    ->                       FROM GGGGGG_MMMM_SSS T
    ->                      WHERE T.SBSID = 555555555555) B
    ->              WHERE B.VAL = 1)
    ->         OR NOT EXISTS
    ->      (SELECT 1 FROM GGGGGG_MMMM_SSS WHERE SBSID = 555555555555);
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.014 sec)

可以看到效率很快

7.抓下我这时时的outline data直接绑定上就行

select /*+parallel(32)*/ SQL_ID      from gv$sql_audit      where tenant_id = 1001       and query_sql like '%A.AAAID = 555555555555%'       order by elapsed_time desc limit 10\G

SQL_ID: D4EA6C918CFA742C8C31F7C94E36EAD5
 
select  outline_id,sql_id,  OUTLINE_DATA from gv$plan_cache_plan_stat where sql_id ='D4EA6C918CFA742C8C31F7C94E36EAD5'\G
*************************** 1. row ***************************
  outline_id: -1
      sql_id: D4EA6C918CFA742C8C31F7C94E36EAD5
OUTLINE_DATA: /*+ BEGIN_OUTLINE_DATA INDEX(@"SEL$3" "USER01.T"@"SEL$3" "IDX_GRPGRPGRP_SBSID") LEADING(@"SEL$7" ("USER01.A"@"SEL$7" "USER013.DDDD_IIII"@"SEL$7" )) USE_NL(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) PQ_DISTRIBUTE(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" ) BC2HOST NONE) NO_USE_NL_MATERIALIZATION(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) INDEX(@"SEL$7" "USER01.A"@"SEL$7" "IDX_CCCATT") INDEX(@"SEL$7" "USER013.SA_DB_DICDIC"@"SEL$7" "PK_SA_DB_DICDIC") INDEX(@"SEL$6" "USER01.GGGGGG_MMMM_SSS"@"SEL$6" "IDX_GRPGRPGRP_SBSID") END_OUTLINE_DATA*/

业务租户的业务账号登陆数据库,绑定outline

obclient [USER01]> create OUTLINE OLT_665C557D49877E315A74446D7772F8D3 ON '665C557D49877E315A74446D7772F8D3' USING HINT 
    ->  /*+ BEGIN_OUTLINE_DATA INDEX(@"SEL$3" "USER01.T"@"SEL$3" "IDX_GRPGRPGRP_SBSID") LEADING(@"SEL$7" ("USER01.A"@"SEL$7" "USER013.DDDD_IIII"@"SEL$7" )) USE_NL(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) PQ_DISTRIBUTE(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" ) BC2HOST NONE) NO_USE_NL_MATERIALIZATION(@"SEL$7" ("USER013.DDDD_IIII"@"SEL$7" )) INDEX(@"SEL$7" "USER01.A"@"SEL$7" "IDX_CCCATT") INDEX(@"SEL$7" "USER013.SA_DB_DICDIC"@"SEL$7" "PK_SA_DB_DICDIC") INDEX(@"SEL$6" "USER01.GGGGGG_MMMM_SSS"@"SEL$6" "IDX_GRPGRPGRP_SBSID") END_OUTLINE_DATA*/;
Query OK, 0 rows affected (0.037 sec)

sys租户刷下plancache

MySQL [oceanbase]>  alter system flush plan cache sql_id='665C557D49877E315A74446D7772F8D3' tenant='obsc4' global;

8.检查下当前效率

MySQL [oceanbase]> select        
    ->  sql_id,       
    ->  hit_count,   
    ->  executions,       
    ->  avg_exe_usec,       
    ->  slowest_exe_usec,       
    ->  plan_id,   
    ->  type,       
    ->  last_active_time  
    -> from gv$plan_cache_plan_stat where    sql_id  IN  ('665C557D49877E315A74446D7772F8D3'   ) order by sql_id;
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| sql_id                           | hit_count | executions | avg_exe_usec | slowest_exe_usec | plan_id   | type | last_active_time           |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+
| 665C557D49877E315A74446D7772F8D3 |        51 |        52  |          4393|             11333|  196923939|    3 | 2023-05-11 12:56:00.133583 |
| 665C557D49877E315A74446D7772F8D3 |        63 |        64  |          2803|              9929|    6682408|    3 | 2023-05-11 12:56:00.061327 |
+----------------------------------+-----------+------------+--------------+------------------+-----------+------+----------------------------+

总结

当核心业务出现问题时候,一定要快刀斩乱麻,先恢复。在业务无人认领sql的情况下,影响业务的sql杀掉还会再上来,所以需要尽快优化,对于瓶颈点无法细致分析可以进行尽善尽美的优化,那么就需要先把关键问题搞定,主大计者,必执简以御繁。

行之所向,莫问远方。

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论