今天我帮一个客户分析一个Oracle RAC故障,说是运行在我司zData分布式存储上;当然第一时间我们就排除了zData本身的问题,为何呢?
因为该环境上部署了3套数据库,只有其中一套出了问题,这就很好解释了。
这个用户的数据库版本是Oracle 11.2.0.4,Oracle dba们都知道这是Oracle 11g中堪称最为稳定的数据库版本,没有之一。
我们来看看相关的日志,首先节点1最早开始出现相关报错:
Sat Dec 02 01:21:51 2023
Errors in file u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_241648.trc (incident=585920):
ORA-00600: internal error code, arguments: [17182], [0x7F48EF4B4F88], [], [], [], [], [], [], [], [], [], []
Incident details in: u01/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_585920/xxxx1_ora_241648_i585920.trc
.......
Sat Dec 02 01:24:09 2023
Block recovery from logseq 187420, block 393115 to scn 16343981745481
Recovery of Online Redo Log: Thread 1 Group 13 Seq 187420 Reading mem 0
Mem# 0: +FRADG/xxxx/onlinelog/group_13.290.1045325125
Block recovery completed at rba 187420.393158.16, scn 3805.1631184222
Block recovery from logseq 187420, block 390945 to scn 16343981745481
Recovery of Online Redo Log: Thread 1 Group 13 Seq 187420 Reading mem 0
Mem# 0: +FRADG/xxxx/onlinelog/group_13.290.1045325125
Block recovery completed at rba 187420.393158.16, scn 3805.1631184222
Errors in file u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_241648.trc (incident=585921):
ORA-00600: internal error code, arguments: [KSMFPG2], [0x7F48EF4B4000], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [17182], [0x7F48EF4B4F88], [], [], [], [], [], [], [], [], [], []
Incident details in: u01/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_585921/xxxx1_ora_241648_i585921.trc
LOGMINER: summary for session# = 2150937857
.......Errors in file u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_241648.trc:
ORA-00600: internal error code, arguments: [KSMFPG2], [0x7F48EF4B4000], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [17182], [0x7F48EF4B4F88], [], [], [], [], [], [], [], [], [], []
.......
最后实例crash后,Oracle为了维持数据一致性,其他存活节点需要继续进行事务恢复(SMON进程来完成);均为如下类似错误:
ORACLE Instance xxxx (pid = 41) - Error 600 encountered while recovering transaction (204, 69) on object 4224618.
Errors in file u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_smon_352248.trc:
ORA-00600: internal error code, arguments: [17182], [0x7F3D2988C000], [], [], [], [], [], [], [], [], [], []
......
ORA-00600: internal error code, arguments: [KSMFPG2], [0x7F3D2988B000], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [17182], [0x7F3D2988C000], [], [], [], [], [], [], [], [], [], []
Incident details in: u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_275279/xxxx2_smon_352248_i275279.trc
......
ORA-00600: internal error code, arguments: [KGHALO4], [0x00C108C58], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [KSMFPG2], [0x7F3D2988B000], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [17182], [0x7F3D2988C000], [], [], [], [], [], [], [], [], [], []
Incident details in: u01/app/oracle/diag/rdbms/xxxx/xxxx2/incident/incdir_275280/xxxx2_smon_352248_i275280.trc
Fatal internal error happened while SMON was doing instance transaction recovery.
......
SMON (ospid: 352248): terminating the instance due to error 474
Sat Dec 02 01:46:37 2023
opiodr aborting process unknown ospid (385507) as a result of ORA-1092
Sat Dec 02 01:46:37 2023
System state dump requested by (instance=2, osid=352248 (SMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_diag_352116_20231202014637.trc
Sat Dec 02 01:46:39 2023
ORA-1092 : opitsk aborting process
Instance terminated by SMON, pid = 352248
实际上从上面的报错来看,基本上就可以直接断定均为内存有关。首先分析对应的SMON trace 能看到如下内容:
TRN CTL:: seq: 0xa8ae chd: 0x001d ctl: 0x003e inc: 0x00000000 nfb: 0x0000
mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe)
uba: 0x024177f8.a8ae.33 scn: 0x0edd.613d3de6
Version: 0x01
FREE BLOCK POOL::
uba: 0x00000000.a8ae.32 ext: 0x5 spc: 0x187e
uba: 0x00000000.a8ae.29 ext: 0x5 spc: 0x1f58
uba: 0x00000000.a8ae.1d ext: 0x5 spc: 0x2f40
uba: 0x00000000.a738.50 ext: 0x7 spc: 0x1a76
uba: 0x00000000.8f7b.2e ext: 0x9 spc: 0x299a
TRN TBL::
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
------------------------------------------------------------------------------------------------
0x00 9 0x00 0x2a4a3c 0x0020 0x0edd.613f585b 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1701451851
0x01 9 0x00 0x2a4e2b 0x0005 0x0edd.6140c75a 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1701451901
0x02 9 0x00 0x2a486a 0x0030 0x0edd.613d7141 0x024177f6 0x0000.000.00000000 0x00000001 0x00000000 1701451780
.......
0x43 9 0x00 0x2a4c09 0x003c 0x0edd.6140e53e 0x024177f7 0x0000.000.00000000 0x00000001 0x00000000 1701451990
0x44 9 0x00 0x2a50f8 0x001a 0x0edd.613f5c67 0x00000000 0x0000.000.00000000 0x00000000 0x00000000 1701451851
0x45 10 0x80 0x2a40f7 0x0005 0x0edd.61434197 0x024177f9 0x0000.000.00000000 0x00000002 0x00000000 0
EXT TRN CTL::
usn: 204
其中undo header的dump结构中:0x45转换为10进制则为69。Usn:204 表示回滚段编号(state 10表示为活动事务)。接着再往下就能看到读取到的错误数据块:
GLOBAL CACHE ELEMENT DUMP (address: 0x309fb40e98):
id1: 0x8d78f id2: 0x4e pkey: OBJ#4224618 block: (78/579471)
lock: X rls: 0x0 acq: 0x0 latch: 20
flags: 0x20 fair: 0 recovery: 0 fpin: 'kdowh01: kdoiur'
bscn: 0xedd.614a5255 bctx: (nil) write: 0 scan: 0x0
lcp: (nil) lnk: [NULL] lch: [0x1e3ed40b70,0x1e3ed40b70]
seq: 29 hist: 239 143:0 85 16 352 329 144:6 14 7 352 329
LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
flg: 0x00000004 state: XCURRENT tsn: 9 tsh: 0 mode: EXCL
pin: 'kdowh01: kdoiur'
addr: 0x1e3ed40a38 obj: 4224618 cls: DATA bscn: 0x3244.37355f37
Buffer being modified, corruption checks may be unreliable
buffer tsn: 9 rdba: 0x45393946 (276/3750214)
scn: 0x3244.37355f37 seq: 0x5f flg: 0x45 tail: 0x52550601
frmt: 0x00 chkval: 0x4134 type: 0x26=KTU SMU HEADER BLOCK
Hex dump of corrupt header 4 = CORRUPT
进一步分析相关trace文件可以发现其中更为明显的证据:
EXTENT 12 addr=0x7ff6b9f57008
Chunk 7ff6b9f57018 sz= 32712 free " "
Chunk 7ff6b9f5efe0 sz= 16400 freeable "callheap " ds=0xc108b60
Chunk 7ff6b9f62ff0 sz= 16400 freeable "callheap " ds=0xc108b60
EXTENT 13 addr=0x7ff6ba4b7008
Chunk 7ff6ba4b7018 sz= 16288 free " "
Chunk 7ff6ba4bafb8 sz= 16400 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba4befc8 sz= 16400 BAD MAGIC NUMBER IN NEXT CHUNK (303533395F343342)
freeable "callheap " ds=0xc108b60
Dump of memory from 0x00007FF6BA4BEFC8 to 0x00007FF6BA4C3FD8
。。。。。。
EXTENT 14 addr=0x7ff6ba6f0008
Chunk 7ff6ba6f0018 sz= 80 perm "perm " alo=80
Chunk 7ff6ba6f0068 sz= 656 perm "perm " alo=656
Chunk 7ff6ba6f02f8 sz= 4872 free " "
Chunk 7ff6ba6f1600 sz= 1112 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6f1a58 sz= 1112 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6f1eb0 sz= 16400 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6f5ec0 sz= 4144 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6f6ef0 sz= 32848 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6fef40 sz= 1032 recreate "callheap " latch=(nil)
ds 00c108c18 sz= 1032 ct= 1
Chunk 7ff6ba6ff348 sz= 1112 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6ff7a0 sz= 1112 freeable "callheap " ds=0xc108b60
Chunk 7ff6ba6ffbf8 sz= 1032 recreate "callheap " latch=(nil)
ERROR, BAD NEXT EXTENT (0x7ff6ba4c2ff0, 0xedd02002c800141, 0x6)
Dump of memory from 0x00007FF6BA4C1FF0 to 0x00007FF6BA4C3FF0
更为详细的情况可以参考其中两篇Mos文档:
ORA-07445 [kglnpfr] Errors or Several Internal Errors due to Shared Pool Memory Corruptions in 11.2.0.4 ...Instance may Crash |
Bug 22243719 - Several Internal Errors due to Shared Pool Memory Corruptions in 11.2.0.4 and later. Instance may Crash |
这里摘录其中文档中的一小段描述:
CAUSE
The reported errors and the symptoms closely match Unpublished Bug 22243719 - Several Internal Errors due to Shared Pool Memory Corruptions in 11.2.0.4 and later. Instance may Crash.
Shared pool memory corruptions could occur, causing instance crashes. This issue has been introduced in 11.2.0.4 or greater.
A corruption occurs in the shared pool. This can result in any of a variety of crashes and internal errors (a few are listed below).
The key to this particular case of corruption is the bad data. Ideally, two consecutive words should be corrupted that way. For example, in the case of a ORA_600 [17147] internal error, we might see in the heap dump:
"NEXT CHUNK'S PREVIOUS POINTER NOT POINTING TO CURRENT CHUNK"
从文档的Bug描述来看,提到在11.2.0.4.161018中已修复,但从实际情况来看,修复可能并不彻底。当然,Oracle MOS也提供了相应的解决方案,除了打补丁、升级之外还可以通过设置"_pga_auto_snapshot_threshold"=0 来进行规避。
对于使用Oracle 11.2.0.4版本的用户,建议更新到最新版本。