记录一次CPU使用率过高故障分析SQL解决案例

2024年 6月 20日 63.7k 0

记录一次CPU使用率过高故障分析SQL解决案例-1

今天处理一个阿里云ECS上自建Oracle数据库,CPU飙升到100%的问题,把整个过程分享给各位小伙伴,希望能帮助到您。

问题现象

早上接到业务系统运维人员电话,告知系统很卡无法运行。问题数据库是一台在阿里云ECS上自建的Oracle数据库。登录到操作系统上排查,先把操作系统环境看一下。4cpu16G内存100G的磁盘还有剩余。操作系统环境基本上没有问题。

Last login: Thu Jun 20 09:03:44 2024 from 223.166.7.46

Welcome to Alibaba Cloud Elastic Compute Service !

[root@iZctgmtjcrowglZ ~]# su - oracle
[oracle@iZctgmtjcrowglZ ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Stepping: 1
CPU MHz: 2499.996
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt nt_good arat
[oracle@iZctgmtjcrowglZ ~]$ free -h
total used free shared buff/cache available
Mem: 15G 614M 5.3G 430M 9.5G 14G
Swap: 8.0G 0B 8.0G
[oracle@iZctgmtjcrowglZ ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.8G 431M 7.3G 6% /dev/shm
tmpfs 7.8G 828K 7.8G 1% /run
tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup
/dev/vda1 99G 72G 23G 76% /
tmpfs 1.6G 0 1.6G 0% /run/user/54321
tmpfs 1.6G 0 1.6G 0% /run/user/0
[oracle@iZctgmtjcrowglZ ~]$

用oratop检查数据库发现CPU使用率100%,大部分都在等待db file sequential read,再用查看前十位使用CPU的进程,发现很多查询SQL的进程耗CPU。都指向了5w3u3qw4sp938这个sql_id。

记录一次CPU使用率过高故障分析SQL解决案例-2

[oracle@iZctgmtjcrowglZ ~]$ ps aux|head -1;ps aux|grep -v PID|sort -rn -k +3|head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
oracle 5431 69.1 1.3 6559252 211708 ? Rs 08:38 39:11 oracleorcl (LOCAL=NO)
oracle 6075 62.6 0.8 6558228 137344 ? Rs 08:48 29:13 oracleorcl (LOCAL=NO)
oracle 5447 60.9 1.2 6560788 195392 ? Rs 08:38 34:21 oracleorcl (LOCAL=NO)
oracle 6687 53.7 0.8 6558232 141320 ? Ss 08:57 19:47 oracleorcl (LOCAL=NO)
oracle 6077 52.4 1.6 6562856 268668 ? Rs 08:48 24:26 oracleorcl (LOCAL=NO)
oracle 7447 46.2 1.0 6559252 177336 ? Rs 09:07 12:28 oracleorcl (LOCAL=NO)
oracle 6689 42.9 1.1 6560804 186488 ? Rs 08:57 15:50 oracleorcl (LOCAL=NO)
oracle 7454 37.2 1.6 6561316 264144 ? Ss 09:07 10:02 oracleorcl (LOCAL=NO)
oracle 7812 4.6 1.3 6554140 214660 ? Ss 09:12 1:01 oracleorcl (LOCAL=NO)
oracle 9232 1.0 0.0 155464 3896 pts/3 R+ 09:34 0:00 ps aux
[oracle@iZctgmtjcrowglZ ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Thu Jun 20 09:35:13 2024

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select sql_id from v$session where paddr in (select addr from v$process where spid in ('5447','7454','5431','6075','6689'));

SQL_ID
-------------
5w3u3qw4sp938
5w3u3qw4sp938
5w3u3qw4sp938
5w3u3qw4sp938
5w3u3qw4sp938

SQL> SELECT count(*) from v$session where sql_id = '5w3u3qw4sp938';

COUNT(*)
----------
8

SQL>

问题分析

既然已经定位到SQL,那查出具体SQL语句和绑定变量,再用绑定变量看一下执行计划,带入常量发现执行了1个多小时还未出结果,选择取消。

SQL> select sql_text from v$sql where sql_id='5w3u3qw4sp938';

SQL_TEXT
--------------------------------------------------------------------------------
SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINI
D = ISGB.ID AND SSH.GENINBILLNO = :B2 AND SSH.GENIOTYPE = :B1 AND SSH.STOPFLAG =
'00'

SQL> set lineszie 200
SQL> col name for a20
SQL> col value_string for a40
SQL> select name,datatype_string,value_string from V$SQL_BIND_CAPTURE where SQL_ID='5w3u3qw4sp938';

NAME DATATYPE_STRING VALUE_STRING
-------------------- ------------------------------ ----------------------------------------
:B2 VARCHAR2(128) 240619-118247943692633
:B1 NUMBER 22201

SQL> conn CMSDTP
Enter password:
Connected.
SQL> VARIABLE B1 NUMBER;
SQL> VARIABLE B2 VARCHAR2;
SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = :B2 AND SSH.GENIOTYPE = :B1 AND SSH.STOPFLAG = '00';

COUNT(1)
----------
0

Execution Plan
----------------------------------------------------------
Plan hash value: 343233996

--------------------------------------------------------------------------------
----------------------

| Id | Operation | Name | Rows | Bytes | C
ost (%CPU)| Time |

--------------------------------------------------------------------------------
----------------------

| 0 | SELECT STATEMENT | | 1 | 18 |
0 (0)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 18 |
| |

| 2 | NESTED LOOPS | | 1 | 18 |
0 (0)| 00:00:01 |

| 3 | NESTED LOOPS | | 1 | 18 |
0 (0)| 00:00:01 |

| 4 | INDEX FULL SCAN | PK_INPT_SCM_GEN_BILL | 1 | 13 |
0 (0)| 00:00:01 |

|* 5 | INDEX RANGE SCAN | IDX_SCM_SALBILL_GIO | 1 | |
0 (0)| 00:00:01 |

|* 6 | TABLE ACCESS BY INDEX ROWID| SCM_SALBILL_HDR | 1 | 5 |
0 (0)| 00:00:01 |

--------------------------------------------------------------------------------
----------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access("SSH"."GENIOTYPE"=TO_NUMBER(:B1))
6 - filter("SSH"."GENINID" IS NOT NULL AND "SSH"."GENINBILLNO"=:B2 AND
"SSH"."STOPFLAG"='00' AND "SSH"."GENINID"="ISGB"."ID")

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
339 consistent gets
333 physical reads
0 redo size
525 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL> SET AUTOT OFF
SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00';
^CSELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00'
*
ERROR at line 1:
ORA-01013: user requested cancel of current operation

SQL>

同时我们用AWRSQRPT产生报告,可以发现大量CPU Time消耗在IO等待上。同时执行计划也和上面一样。

记录一次CPU使用率过高故障分析SQL解决案例-3

继续查看两个表数据量发现并不大,为什么要执行那么长时间耗那么多CPU?按照上面执行计划,继续查发现IDX_SCM_SALBILL_GIO索引居然没有了。我们再来看看这个表上和SQL相关的字段索引情况。发现和SQL相关的SCM_SALBILL_HDR表上字段有太多索引。后来问开发人员,昨天他们又对这些字段添加索引了。至此我们大致可以判断添加过度的索引,没有做分析表,统计信息缺失导致执行计划错误,产生大量CPU和IO。

SQL> select count(1) from INPT_SCM_GEN_BILL;

COUNT(1)
----------
178760

SQL> select count(1) from SCM_SALBILL_HDR;

COUNT(1)
----------
182387

SQL> select owner,index_name,table_name from dba_indexes where index_name='IDX_SCM_SALBILL_GIO';

no rows selected

SQL> col COLUMN_NAME for a40
SQL> select table_owner,table_name,index_name,column_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE');

TABLE_OWNER TABLE_NAME INDEX_NAME COLUMN_NAME
------------------------------ ------------------------------ ---------------------------------------- ----------------------------------------
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_GENINID GENINID
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_GIBN GENINBILLNO
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENINID
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENINBILLNO
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GGGS GENIOTYPE
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GEN1 GENIOTYPE
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GEN1 GENINID
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GENIOTYPE GENIOTYPE
CMSDTP SCM_SALBILL_HDR IDX_SCM_SALBILL_HDR_GENIOTYPE GENINBILLNO

9 rows selected.

SQL> col OBJECT_NAME for a40
SQL> select owner,object_name,status,created from dba_objects where object_name in (select index_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE'));

OWNER OBJECT_NAME STATUS CREATED
------------------------------ ---------------------------------------- ------- ------------
CMSDTP IDX_SCM_SALBILL_GENINID VALID 22-APR-24
CMSDTP IDX_SCM_SALBILL_GIBN VALID 19-JUN-24
CMSDTP IDX_SCM_SALBILL_HDR_GGGS VALID 19-JUN-24
CMSDTP IDX_SCM_SALBILL_HDR_GEN1 VALID 23-SEP-22
CMSDTP IDX_SCM_SALBILL_HDR_GENIOTYPE VALID 23-SEP-22

SQL>

处理过程

找到问题所在后,清理多余的索引,创建和SQL相关字段合理的索引,再对表分析一下更新统计信息。重新查询秒出结果,解决问题。

SQL> drop index CMSDTP.IDX_SCM_SALBILL_GENINID;

Index dropped.

SQL> drop index CMSDTP.IDX_SCM_SALBILL_GIBN;

Index dropped.

SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GENIOTYPE;

Index dropped.

SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GEN1;

Index dropped.

SQL> drop index CMSDTP.IDX_SCM_SALBILL_HDR_GGGS;

Index dropped.

SQL> select table_owner,table_name,index_owner,index_name,column_name from dba_ind_columns where table_owner='CMSDTP' and table_name='SCM_SALBILL_HDR' and column_name in ('GENINID','GENINBILLNO','GENIOTYPE');

no rows selected

SQL> create index CMSDTP.IDX_SCM_SALBILL_HDR_GEN1 on CMSDTP.SCM_SALBILL_HDR(GENINID,GENIOTYPE);

Index created.

SQL> create index CMSDTP.IDX_SCM_SALBILL_HDR_GENIOTYPE on CMSDTP.SCM_SALBILL_HDR(GENINBILLNO,GENIOTYPE);

Index created.

SQL> conn / as sysdba
Connected.
SQL> exec dbms_stats.gather_table_stats('CMSDTP','SCM_SALBILL_HDR',cascade=>true);

PL/SQL procedure successfully completed.

SQL> conn CMSDTP
Enter password:
Connected.
SQL> SET AUTOTRACE ON
SQL> SELECT COUNT(1) FROM SCM_SALBILL_HDR SSH,INPT_SCM_GEN_BILL ISGB WHERE SSH.GENINID = ISGB.ID AND SSH.GENINBILLNO = '240619-118247943692633' AND SSH.GENIOTYPE = 22201 AND SSH.STOPFLAG = '00';

COUNT(1)
----------
1

Execution Plan
----------------------------------------------------------
Plan hash value: 46481803

----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 48 | 1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 48 | | |
| 2 | NESTED LOOPS | | 1 | 48 | 1 (0)| 00:00:01 |
| 3 | NESTED LOOPS | | 1 | 48 | 1 (0)| 00:00:01 |
| 4 | INDEX FULL SCAN | PK_INPT_SCM_GEN_BILL | 1 | 13 | 0 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | IDX_SCM_SALBILL_HDR_GEN1 | 1 | | 1 (0)| 00:00:01 |
|* 6 | TABLE ACCESS BY INDEX ROWID| SCM_SALBILL_HDR | 1 | 35 | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access("SSH"."GENINID"="ISGB"."ID" AND "SSH"."GENIOTYPE"=22201)
6 - filter("SSH"."GENINBILLNO"='240619-118247943692633' AND "SSH"."STOPFLAG"='00')

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
654515 consistent gets
84959 physical reads
0 redo size
526 bytes sent via SQL*Net to client
520 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

SQL>

CPU使用率也恢复正常

记录一次CPU使用率过高故障分析SQL解决案例-4

总结

本案例还是一条SQL引发的高CPU情况,是开发人员没有根据实际情况添加索引导致。这个案例整体的分析过程不算太繁琐,SQL语句也还算是简单,也没有用到10046事件来跟踪分析。但有几点还是要注意,索引应根据对应的SQL语句还有表的数据量来合理添加,添加后SQL涉及表的统计信息需要重新收集。

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论