点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享,不管IT什么活儿,干就完了!!!
背景描述
某天,某库发出了一条实例连接异常的语音告警。立即登录该节点进行检查,发现数据库的响应非常缓慢,无法正常登录数据库,此时实例已经处于Hang死状态。情况紧急,手动强制重启实例之后,数据库恢复了正常。
版本信息:
-
操作系统:RHEL7.5
- 数据库:Oracel 19.10.0.0.0(6节点RAC)
处理过程
2.1 问析定位
1)登录检查节点 3 运行情况,发现数据库响应缓慢,存在大量 library cache等待事件
2023-07-24 00:27:38.562 0 library cache: mutex X 559
2023-07-24 00:27:54.366 0 library cache lock 11,181
2023-07-24 00:27:54.366 0 library cache: mutex X 34
2023-07-24 00:28:10.173 0 library cache lock 9,353
2023-07-24 00:28:10.173 0 library cache: mutex X 2,299
2023-07-24 00:28:25.686 0 library cache lock 9,497
2023-07-24 00:28:25.686 0 library cache: mutex X 2,680
2023-07-24 00:28:40.697 0 library cache lock 11,523
2023-07-24 00:28:40.697 0 library cache: mutex X 993
2023-07-24 00:28:55.901 0 library cache lock 5,960
2023-07-24 00:28:55.901 0 library cache: mutex X 6,780
2023-07-24 00:29:11.702 0 library cache lock 9,320
2023-07-24 00:29:11.702 0 library cache: mutex X 3,527
2023-07-24 00:29:26.562 0 library cache lock 10,912
2023-07-24 00:29:26.562 0 library cache: mutex X 2,009
2023-07-24 00:29:41.761 0 library cache lock 3,958
2023-07-24 00:29:41.761 0 library cache: mutex X 9,000
2023-07-24 00:29:56.874 0 library cache lock 10,903
2)检查数据库trace,发现连接已经超出最大限制,初步判断是由 library cache等待阻塞导致连接无法及时处理引发
2023-07-24T00:42:51.117067+08:00
ORA-00020: maximum number of processes (20000) exceeded
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
2023-07-24T00:42:54.086832+08:00
2023-07-24T00:43:48.127918+08:00
Process W00J submission failed with error = 20
2023-07-24T00:43:51.128106+08:00
Process W01R submission failed with error = 20
2023-07-24T00:43:52.093439+08:00
ORA-00020: maximum number of processes (20000) exceeded
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
3)手动查杀相关会话,等待事件有所下降但效果不明显,考虑到尽快恢复业务,01:43 分左右,手动强制关闭并重启实例,最终恢复正常
2023-07-24T01:54:22.324043+08:00
Instance shutdown complete (OS id: 240400)
2023-07-24T01:54:30.637910+08:00
Starting ORACLE instance (normal) (OS id: 247320)
2023-07-24T01:54:30.766177+08:00
****************************************************
Sys-V shared memory will be used for creating SGA
****************************************************
2023-07-24T01:54:30.767043+08:00
**********************************************************************
2023-07-24T01:54:30.767088+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
2023-07-24T01:54:30.767163+08:00
Per process system memlock (soft) limit = 500G
2023-07-24T01:54:30.767204+08:00
Expected per process system memlock (soft) limit to lock
instance MAX SHARED GLOBAL AREA (SGA) into memory: 300G
2023-07-24T01:54:30.767282+08:00
**********************************************************************
4)检查相关事件的历史信息,发现和一条INSERT SQL 语句相关,该语句用于向审计记录表插入审计信息
-- b3853arjnybzv
INSERT INTO AUDSYS.AUD$UNIFIED (AUDIT_TYPE, SESSIONID, PROXY_SESSIONID, OS_USER, HOST_NAME, TERMINAL
, INSTANCE_ID, DBID, AUTHENTICATION_TYPE, USERID, PROXY_USERID, EXTERNAL_USERID, GLOBAL_USERID, CLIE
NT_PROGRAM_NAME, DBLINK_INFO, XS_USER_NAME, XS_SESSIONID, ENTRY_ID, STATEMENT_ID, EVENT_TIMESTAMP, A
CTION, RETURN_CODE, OS_PROCESS, TRANSACTION_ID, SCN, EXECUTION_ID, OBJ_OWNER, OBJ_NAME, CLIENT_IDENT
IFIER, NEW_OWNER, NEW_NAME, OBJECT_EDITION, SYSTEM_PRIVILEGE_USED, SYSTEM_PRIVILEGE, AUDIT_OPTION, O
……
5)检查相关 trace,进一步验证了等待和该 SQL 有关
*** 2023-07-24T01:52:33.940747+08:00
HM: Early Warning - Session ID 8005 serial# 53205 OS PID 119329 (FG)
is waiting on 'library cache lock' for 33 seconds, wait id 769
p1: 'handle address'=0x69f80a2c8, p2: 'lock address'=0x589f111a8, p3: '100*mode+namespace'=0x520002
Blocked by Session ID 22260 serial# 57425 on instance 3
which is waiting on 'library cache lock' for 31 seconds
p1: 'handle address'=0x69f80a2c8, p2: 'lock address'=0x51e3eeaa0, p3: '100*mode+namespace'=0x520003
Final Blocker is Session ID 14289 serial# 52127 on instance 3
which is 'not in a wait' for 0 seconds
IO
Total Self- Total Total Outlr Outlr Outlr
Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait
Sess Hangs Hangs Count Secs Count Secs Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
0 0 0 286951398 96425517 0 0 0 library cache lock
HM: Current SQL: INSERT INTO AUDSYS.AUD$UNIFIED (AUDIT_TYPE, SESSIONID, PROXY_SESSIONID, OS_USER, HOST_NAME, TERMINAL, INSTANCE_ID, DBID, AUTHENTICATION_TYPE, USERID, PROXY_USERID, EXTERNAL_USERID, GLOBAL_USERID, CLIENT_PROGRAM_NAME, DBLINK_INFO, XS_USER_NAME, XS_SESSIONID, ENTRY_ID, STATEMENT_ID, EVENT_TIMESTAMP, ACTION, RETURN_CODE,
6)检查事件相关参数,同时通过 AWR 报告 LIBRARY CACHE ACTIVITY 数据交叉验证,发现 library cache 大量发生在SQL AREA BUILD,这说明游标有大量重新加载失败现象,最常见的原因是和游标相关的基表(AUD$UNIFIED)执行了 DDL 操作
Namespace
Mode COUNT(*)
--------------- ----------
0x7f0003 1
0x7f0002 38
0x520003 156
0x520002 1544911
NAMESPACE_ID NAMESPACE
---------- -----------------
0 SQL AREA
1 TABLE/PROCEDURE
2 BODY
55 TEMPORARY INDEX
64 EDITION
69 DBLINK
72 OBJECT ID
73 SCHEMA
74 DBINSTANCE
75 SQL AREA STATS
79 ACCOUNT_STATUS
82 SQL AREA BUILD