故障处理:Oracle实例Hang死

2024年 1月 25日 62.4k 0

点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享,不管IT什么活儿,干就完了!!!   

背景描述

某天,某库发出了一条实例连接异常的语音告警。立即登录该节点进行检查,发现数据库的响应非常缓慢,无法正常登录数据库,此时实例已经处于Hang死状态。情况紧急,手动强制重启实例之后,数据库恢复了正常。

版本信息:

  • 操作系统:RHEL7.5

  • 数据库:Oracel 19.10.0.0.0(6节点RAC)

处理过程

2.1 问析定位

1)登录检查节点 3 运行情况,发现数据库响应缓慢,存在大量 library cache等待事件

2023-07-24 00:27:38.562 0 library cache: mutex X 559
2023-07-24 00:27:54.366           0 library cache lock          11,181
2023-07-24 00:27:54.366           0 library cache: mutex X 34
2023-07-24 00:28:10.173           0 library cache lock           9,353
2023-07-24 00:28:10.173           0 library cache: mutex X 2,299
2023-07-24 00:28:25.686           0 library cache lock           9,497
2023-07-24 00:28:25.686           0 library cache: mutex X 2,680
2023-07-24 00:28:40.697           0 library cache lock          11,523
2023-07-24 00:28:40.697           0 library cache: mutex X 993
2023-07-24 00:28:55.901           0 library cache lock           5,960
2023-07-24 00:28:55.901           0 library cache: mutex X 6,780
2023-07-24 00:29:11.702           0 library cache lock           9,320
2023-07-24 00:29:11.702           0 library cache: mutex X 3,527
2023-07-24 00:29:26.562           0 library cache lock          10,912
2023-07-24 00:29:26.562           0 library cache: mutex X 2,009
2023-07-24 00:29:41.761           0 library cache lock           3,958
2023-07-24 00:29:41.761           0 library cache: mutex X 9,000
2023-07-24 00:29:56.874           0 library cache lock          10,903

2)检查数据库trace,发现连接已经超出最大限制,初步判断是由 library cache等待阻塞导致连接无法及时处理引发

2023-07-24T00:42:51.117067+08:00
ORA-00020: maximum number of processes (20000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
2023-07-24T00:42:54.086832+08:00
2023-07-24T00:43:48.127918+08:00
Process W00J submission failed with error = 20
2023-07-24T00:43:51.128106+08:00
Process W01R submission failed with error = 20
2023-07-24T00:43:52.093439+08:00
ORA-00020: maximum number of processes (20000) exceeded
 ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.

3)手动查杀相关会话,等待事件有所下降但效果不明显,考虑到尽快恢复业务,01:43 分左右,手动强制关闭并重启实例,最终恢复正常

2023-07-24T01:54:22.324043+08:00
Instance shutdown complete (OS id: 240400)
2023-07-24T01:54:30.637910+08:00
Starting ORACLE instance (normal) (OS id: 247320)
2023-07-24T01:54:30.766177+08:00
****************************************************
 Sys-V shared memory will be used for creating SGA
 ****************************************************
2023-07-24T01:54:30.767043+08:00
**********************************************************************
2023-07-24T01:54:30.767088+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
2023-07-24T01:54:30.767163+08:00
 Per process system memlock (soft) limit = 500G
2023-07-24T01:54:30.767204+08:00
 Expected per process system memlock (soft) limit to lock
 instance MAX SHARED GLOBAL AREA (SGA) into memory: 300G
2023-07-24T01:54:30.767282+08:00
**********************************************************************

4)检查相关事件的历史信息,发现和一条INSERT SQL 语句相关,该语句用于向审计记录表插入审计信息

-- b3853arjnybzv
INSERT INTO AUDSYS.AUD$UNIFIED (AUDIT_TYPE, SESSIONID, PROXY_SESSIONID, OS_USER, HOST_NAME, TERMINAL
          , INSTANCE_ID, DBID, AUTHENTICATION_TYPE, USERID, PROXY_USERID, EXTERNAL_USERID, GLOBAL_USERID, CLIE
            NT_PROGRAM_NAME, DBLINK_INFO, XS_USER_NAME, XS_SESSIONID, ENTRY_ID, STATEMENT_ID, EVENT_TIMESTAMP, A
            CTION, RETURN_CODE, OS_PROCESS, TRANSACTION_ID, SCN, EXECUTION_ID, OBJ_OWNER, OBJ_NAME, CLIENT_IDENT
            IFIER, NEW_OWNER, NEW_NAME, OBJECT_EDITION, SYSTEM_PRIVILEGE_USED, SYSTEM_PRIVILEGE, AUDIT_OPTION, O
……

5)检查相关 trace,进一步验证了等待和该 SQL 有关

*** 2023-07-24T01:52:33.940747+08:00
HM: Early Warning - Session ID 8005 serial# 53205 OS PID 119329 (FG)
     is waiting on 'library cache lock' for 33 seconds, wait id 769
     p1: 'handle address'=0x69f80a2c8, p2: 'lock address'=0x589f111a8, p3: '100*mode+namespace'=0x520002
    Blocked by Session ID 22260 serial# 57425 on instance 3
     which is waiting on 'library cache lock' for 31 seconds
     p1: 'handle address'=0x69f80a2c8, p2: 'lock address'=0x51e3eeaa0, p3: '100*mode+namespace'=0x520003
    Final Blocker is Session ID 14289 serial# 52127 on instance 3
     which is 'not in a wait' for 0 seconds
                                                     IO
 Total Self- Total Total Outlr Outlr Outlr
  Hung Rslvd Rslvd Wait WaitTm Wait WaitTm Wait
  Sess Hangs Hangs Count Secs Count Secs Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     0      0      0 286951398 96425517      0      0      0 library cache lock
 
HM: Current SQL: INSERT INTO AUDSYS.AUD$UNIFIED (AUDIT_TYPE, SESSIONID, PROXY_SESSIONID, OS_USER, HOST_NAME, TERMINAL, INSTANCE_ID, DBID, AUTHENTICATION_TYPE, USERID, PROXY_USERID, EXTERNAL_USERID, GLOBAL_USERID, CLIENT_PROGRAM_NAME, DBLINK_INFO, XS_USER_NAME, XS_SESSIONID, ENTRY_ID, STATEMENT_ID, EVENT_TIMESTAMP, ACTION, RETURN_CODE,

6)检查事件相关参数,同时通过 AWR 报告 LIBRARY CACHE ACTIVITY 数据交叉验证,发现 library cache 大量发生在SQL AREA BUILD,这说明游标有大量重新加载失败现象,最常见的原因是和游标相关的基表(AUD$UNIFIED)执行了 DDL 操作

Namespace
Mode COUNT(*)
--------------- ----------
   0x7f0003                1
   0x7f0002               38
   0x520003              156
   0x520002          1544911

NAMESPACE_ID NAMESPACE
  ---------- -----------------
           0 SQL AREA
           1 TABLE/PROCEDURE
           2 BODY
          55 TEMPORARY INDEX
          64 EDITION
          69 DBLINK
          72 OBJECT ID
          73 SCHEMA
          74 DBINSTANCE
          75 SQL AREA STATS
          79 ACCOUNT_STATUS
          82 SQL AREA BUILD

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论