情况说明
业务反应出现每隔15分钟左右就会出现断联,数据库环境为三节点 RAC 12c版本。
故障描述
检查数据库告警日志发现三个节点alert间隔10到30分钟左右就会出现实例重启和ora-00600错误。报错信息如下:
2024-01-15T12:00:09.818988+08:00
ORACLE Instance xxxxdb2 (pid = 137) - Error 600 encountered while recovering transaction (1, 24) on object 427040.
2024-01-15T12:00:09.819155+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxxcdb/xxxxcdb2/trace/xxxxcdb2_smon_27712.trc:
ORA-00600: internal error code, arguments: [ktubko_1], [], [], [], [], [], [], [], [], [], [], []
2024-01-15T12:00:10.172785+08:00
Dumping diagnostic data in directory=[cdmp_20240115120010], requested by (instance=2, osid=27712 (SMON)), summary=[incident=2576433].
2024-01-15T12:00:25.626083+08:00
Errors in file /u01/app/oracle/diag/rdbms/xxxxcdb/xxxxcdb2/trace/xxxxcdb2_m000_76207.trc (incident=2579744) (PDBNAME=CDB$ROOT):
ORA-00600: internal error code, arguments: [4506], [2], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxxxcdb/xxxxcdb2/incident/incdir_2579744/sxsbcdb2_m000_76207_i2579744.trc
TRC 日志如下:
2024-01-15T12:00:10.172785+08:00
Incident 3176730 created, dump file: /u01/app/oracle/diag/rdbms/xxxxcdb/xxxxcdb2/incident/incdir_3176730/xxxxcdb2_smon_87659_i3176730.trc
2024-01-15T12:00:10.172785+08:00
ORA-00600: internal error code, arguments: [ktubko_1], [], [], [], [], [], [], [], [], [], [], []
ORACLE Instance xxxxcdb2 (pid = 164) - Error 600 encountered while recovering transaction (1, 24) on object 427040.
2024-01-15T12:00:10.172785+08:00(CDB$ROOT(1))
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
----- Error Stack Dump -----
ORA-00600: internal error code, arguments: [ktubko_1], [], [], [], [], [], [], [], [], [], [], []
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+119 call kgdsdst() 7FFC942011F8 000000002
7FFC941E2C60 ? 7FFC941E2D78 ?
000000000 000000082 ?
故障分析
通过对比mos匹配 600报错,发现相关错误说明 RAC 3个节点内的SGA大小不一致:
节点1 SGA_TARGET=329G
节点2 SGA_TARGET=329G
节点3 SGA_TARGET=419G
将SGA大小统一修改为329G后仍旧出现实例崩溃重启故障。
根据日志内报错都指向 Smon 进程在恢复事务时 object_id= 427040 的对象出现了错误。
ORACLE Instance xxxxdb2 (pid = 137) - Error 600 encountered while recovering transaction (1, 24) on object 427040.
所以暂时通过隐含参数关闭smon的事务恢复进程,避免实例持续崩溃重启,以打开状态进行进一步检查。
alter system set "_smu_debug_mode"=1024;
根据日志中给出object_id 在数据库内找到相应的对象类型及对象名
select owner,object_name,object_type from dba_objects where object_id=427040;
OWNER OBJECT_NAME OBJECT_TYPE
----------- --------------------------- -----------------------------------
SYS WRH$_LATCH_BL_PK INDEX
发现对象是awr报告相关表的索引,查出索引名称
select table_name,index_name from dba_indexes where index_name='WRH$_LATCH_BL_PK';
TABLE_NAME INDEX_NAME
-------------- --------------------------------------------------------------------------
WRH$_LATCH_BL WRH$_LATCH_BL_PK
解决方案
重新创建索引
alter index WRH$_LATCH_BL_PK rebuild online;
重建完成后关闭smon隐含参数
alter system reset "_smu_debug_mode";
根据查出的对象类型不同建议以下方式来解决:
- 如果受影响的object_type是索引,删除并重新创建索引。
- 如果object_type表/表分区受到影响,对表/表分区进行导出备份,truncate表并导入表或者在表空间中将表/表分区在线移动。
- 如果可以承受逻辑不一致,则通过重新创建段来修复问题。
后续观察实例状态及告警日志未出现崩溃现象,问题解决。