其他3

2023年 9月 2日 106.6k 0

enq: TX - row lock contention

一、故障现象

应用反应堵塞,检查数据库等待事件出现'enq: TX - row lock contention'

二、可能故障原因

  • 业务更新或者删除同一行记录
  • 对创建位图索引的列值更新
  • 对主键或唯一键插入相同记录

三、分析思路

四、应急处理

1、在enq: TX - row lock contention发生的实例上执行查询:

set linesize 180
set pagesize 999
column username format a16
column event format a40
column sql_child_number format 9999 heading 'SQL Child|Number'
column blocking_instance format 9999 heading 'Blocker|Instance'
column blocking_session format 99999 heading 'Blocker|Session'
column FINAL_BLOCKING_INSTANCE format 9999 heading 'F_Blocker|Instance'
column FINAL_BLOCKING_SESSION format 99999 heading 'F_Blocker|Session'
column "Lock" format a6
column "Mode" format a6
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       chr(to_char(bitand(p1, -16777216)) / 16777215) ||
       chr(to_char(bitand(p1, 16711680)) / 65535) "Lock",
       to_char(bitand(p1, 65535)) "Mode",
       blocking_instance,
       blocking_session,
       final_blocking_instance,
       final_blocking_session,
       last_call_et
  From v$session
 Where event = 'enq: TX - row lock contention';

2、关注blocking_instance/blocking_session和final_blocking_instance/final_blocking_session

1) 如果final_blocking_instance/final_blocking_session没有变化,说明可能是由于某个进程没有提交事务导致的enq: TX - row lock contention

--到holder所在的实例执行以下查询:

set lines 180 pages 999
column username format a15
column program format a35
column machine format a20
column event format a30
column sql_id format a13
column status format a10
Select s.sid,
       s.username,
       s.event,
       s.program,
       s.machine,
       s.sql_id,
       s.sql_child_number,
       s.status,
       s.last_call_et
  From v$session s
 Where s.sid in (&sid);

--确认不是DB进程,对堵塞者生成kill语句:

set lines 200 pages 999
col username for a15
col machine for a20
col program for a35
col spid for 99999999999
col dbkill for a40
col oskill for a20
select a.username,
       a.program,
       b.addr,
       b.spid,
       'alter system kill session ''' || a.sid || ',' || a.serial# || ''' immediate;' "dbkill",
       'kill -9 ' || b.spid "oskill"
  from v$session a, v$process b
 where a.paddr = b.addr(+)
   and a.sid = &sid
-- and a.serial#=&serial
 order by a.username, a.program;

--执行生成的dbkill命令后继续观察session状态,如果长时间没有消失,在OS级别执行生成的oskill命令(如果是DB 进程,不可随意kill,否则会导致系统crash):

ps –ef|grep &spid
kill -9 &spid

2) 如果blocking_instance/blocking_session和final_blocking_instance/final_blocking_session均在不停变化,说明可能是由于业务SQL执行慢或者是并发高于正常值导致。

--查看event对应会话详情,反馈运营定位应用

set lines 180 pages 999
column username format a15
column program format a35
column machine format a20
column event format a30
column sql_id format a13
column status format a10
Select s.sid,
       s.username,
       s.event,
       s.program,
       s.machine,
       s.sql_id,
       s.sql_child_number,
       s.status,
       s.last_call_et
  From v$session s
 Where event = 'enq: TX - row lock contention'
 Order By event, sql_id;

--查看SQL执行计划,进一步分析是否是SQL性能问题,如下查询生成查看SQL执行计划的语句:

set lines 160 pages 999
set long 99999
col PLAN_TABLE_OUTPUT for a160
select sql_fulltext,
       parsing_schema_name,
       'select * from table(dbms_xplan.display_cursor(''' || sql_id ||
       ''',' || child_number || ',''all iostats last +PEEKED_BINDS''));' xplan
  from v$sql
 where sql_id = '&sql_id';

包括SQL本身性能问题或者并发突发性增长,这类情况下kill阻塞者并不能解决问题,通知运营决策,允许的条件下通过临时kill所有等待的会话释放

select 'alter system kill session ' || '''' || sid || ',' || serial# || ',@' || inst_id || '''' || ' immediate;'
  from gv$session
 where event = 'enq: TX - row lock contention'
   and type = 'USER';

五、后续跟进

  • 如果是固定阻塞者,根据blocker session信息,联系相关应用确认该事务未正常提交的原因,避免下次问题发生。
  • 如果是SQL性能变差导致执行慢进而产生等待,进一步分析处理SQL性能问题。
  • 如果SQL执行效率没有问题,那么有可能是并发异常导致的等待,根据应用模块及执行SQL的信息,联系相关应用及时处理并发过高的异常。

enq: SQ - contention/enq: SV – contention(DFS lock handle)

一、故障现象

l  现象一:应用反应堵塞,检查数据库等待事件出现' DFS lock handle',并且检查出是SV锁。

l  现象二:应用反应堵塞,检查数据库等待事件出现'enq: SQ - contention'。

二、可能故障原因

l  需要根据p1判断类型来确认,这里处理的例子是SV锁

l  SV锁是RAC上赋予了cache+order的Sequence上发生

l  SQ锁是赋予了cache+noorder属性的Sequence上发生

三、分析思路

对于SV锁,在RAC环境尽量避免order,sequence的作用是保证全局唯一,如果需要同时实现全局有序,应该在应用侧实现。联系相关应用将order属性的sequence重建为noorder,建议停业务改。

四、应急处理

在DFS lock handle发生的实例上执行查询:

set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       username,
       event,
       sql_id,
       sql_child_number,
      p1,
       p2,
       p3 ,
       last_call_et
  From v$session
where event = 'DFS lock handle'
order by event, sql_id, p1;

根据p1查类型

select chr(bitand(&&p1,-16777216)/16777215) ||
chr(bitand(&&p1,16711680)/65535) type,
mod(&p1, 16) md
from dual;

下面的问题以查出来的是SV锁为例

set lines 160 pages 999
col owner for a20
col object_name for a20
col subobject_name for a20
col object_type for a20
col lastddltime for a30
select owner,
       object_name,
       subobject_name,
       object_type,
       status,
       to_char(last_ddl_time, 'yyyy-mm-dd hh24:mi:ss') lastddltime
  from dba_objects
 where object_id=&p2;

--代入上一步查到的sequence_name

select * from dba_sequences where sequence_name = '&sequence_name';

enq: SQ – contention同样可以根据p2找到sequence_name,并查看sequence属性。过程没有太大差别。

根据实际情况严重程度进行如下紧急处理:

a)   增加cache

b)   允许的条件下临时把DFS lock handle或者enq: SQ – contention等待的会话杀掉,观察是否还有等待。

set lines 200 pages 999
col username for a15
col machine for a20
col program for a35
col spid for 99999999999
col dbkill for a40
col oskill for a20
Select a.username,
       a.program,
       a.machine,
       b.addr,
       b.spid,
       'alter system kill session ''' || a.sid || ',' || a.serial# || ''';' "dbkill",
       'kill -9 ' || b.spid "oskill"
  From v$session a, v$process b
 Where a.paddr = b.addr(+)
   And a.event = 'DFS lock handle'
   And a.type = ‘USER’
 Order By a.username, a.program;

enq: US – contention

一、故障现象

应用处理速度下降,查询等待事件发现大量的' enq: US – contention ',同时伴随dc_rollback_segments 的latch: row cache objects等待事件;

二、可能故障原因

l  业务并发量大,online undo segment 的个数不足;

l  业务并发量大,undo 表空间不够大;

三、分析思路

 此类问题主要是业务并发量大,undo表空间不足或online segment不足引起的,解决思路是增加undo表空间,或增加online segment的个数以及降低TUNED_UNDORETENTION。

四、应急处理

在enq: US – contention发生的实例上执行查询:

column UNXPSTEALCNT heading "# Unexpired|Stolen"
column EXPSTEALCNT heading "# Expired|Reused"
column SSOLDERRCNT heading "ORA-1555|Error"
column NOSPACEERRCNT heading "Out-Of-space|Error"
column MAXQUERYLEN heading "Max Query|Length"
select inst_id, to_char(begin_time,'MM/DD/YYYY HH24:MI') begin_time,
UNXPSTEALCNT, EXPSTEALCNT , SSOLDERRCNT, NOSPACEERRCNT, MAXQUERYLEN, TUNED_UNDORETENTION
from v$undostat
order by inst_id, begin_time;

--关注online segment的数量

select tablespace_name , status , count(*) number_of_rollback_segments from dba_rollback_segs group by tablespace_name , status;

--检查undo tablespace 的使用情况

SELECT  /* + RULE */
       df.tablespace_name "Tablespace",
        df.bytes / (1024 * 1024) "Size (MB)",
        SUM (fs.bytes) / (1024 * 1024) "Free (MB)",
        NVL (ROUND (SUM (fs.bytes) * 100 / df.bytes), 1) "% Free",
        ROUND ( (df.bytes - SUM (fs.bytes)) * 100 / df.bytes) "% Used"
   FROM dba_free_space fs,
        (  SELECT tablespace_name, SUM (bytes) bytes
             FROM dba_data_files
         GROUP BY tablespace_name) df
  WHERE fs.tablespace_name(+) = df.tablespace_name
GROUP BY df.tablespace_name, df.bytes
UNION ALL
 SELECT  /* + RULE */
       df.tablespace_name tspace,
        fs.bytes / (1024 * 1024),
        SUM (df.bytes_free) / (1024 * 1024),
        NVL (ROUND ( (SUM (fs.bytes) - df.bytes_used) * 100 / fs.bytes), 1),
        ROUND ( (SUM (fs.bytes) - df.bytes_free) * 100 / fs.bytes)
   FROM dba_temp_files fs,
        (  SELECT tablespace_name, bytes_free, bytes_used
             FROM v$temp_space_header
         GROUP BY tablespace_name, bytes_free, bytes_used) df
  WHERE fs.tablespace_name(+) = df.tablespace_name
GROUP BY df.tablespace_name,
        fs.bytes,
        df.bytes_free,
        df.bytes_used
ORDER BY 4 DESC;

1)   如果TUNED_UNDORETENTION的值异常大,online 状态的undo segment 不是很多,且会慢慢增加,同时undo 表空间使用率不高,根据实际情况严重程度进行如下紧急处理:

a)   设置UNDO TABLESPACE里的所有DATAFILE 为AUTOEXTEND ON,同时MAX SIZE设置为当前UNDO TABLESPACE 大小

语法如下:

ALTER DATABASE DATAFILE ‘’ AUTOEXTEND ON MAXSIZE ;

b)   根据实例情况设置 _rollback_segment_count 为比较大的值,语法如下:

alter system set "_rollback_segment_count"=2000;

2)   如果TUNED_UNDORETENTION的值异常大,online 状态的undo segment基本保持不变,且undo 表空间使用率达到100%,根据实际情况严重程度进行如下紧急处理:

增加undo 表空间的大小:

 alter tablespace UNDOTBS1XXX ADD DATAFILE 'XXXXX' size XXXXM;

library cache lock/pin

一、故障现象

某些session执行操作被堵塞,检查event发现’library cache lock/pin’等待;

二、可能故障原因

l  library cache lock/pin发生在多个session对相同library cache对象进行争用发生,一般来说在存储过程编译过程中发生并堵塞编译。

三、分析思路

此类问题主要是由于并发执行对象编译导致的,解决思路就是将编译动作串行执行,减少并发争用。同时,后续需要查询此类操作为什么发起,在业务高峰期应当避免。 

四、应急处理

在RAC环境中,Library cache lock/pin与enqueue不同,他不是全局等待。首先我们在其中一个发生library cache lock/pin等待的节点上查看等待事件:

set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
  From v$session
 Where event Not In ('SQL*Net message from client',
                     'rdbms ipc message',
                     'rdbms ipc message',
                     'SQL*Net message to client',
                     'pmon timer',
                     'smon timer',
                     'wakeup time manager',
                    'gcs remote message')
   And event Not Like 'Streams AQ:%'
 Order By event, sql_id, p1;

根据p1raw查找library cache lock/pin的阻塞对象:

--library cache lock
set linesize 200 pagesize 999
column username format a10
column program format a30
column machine format a15
column sql_id format a14
column event format a30
column owner format a12
column object_name format a25
select s.sid, s.username, s.program, s.machine, s.sql_id, s.event,
       s.status, l.kgllktype as type, l.kgllkmod as lmode,
       o.kglnaown as owner, o.kglnaobj as object_name
 from dba_kgllock l, x$kglob o, v$session s
where l.kgllkhdl = '&p1raw'
  and l.kgllkreq = 0
  and l.kgllkmod > 1
  and l.kgllkhdl = o.kglhdadr
  and s.SADDR = l.kgllkuse;
 

--library cache pin
set linesize 160
set linesize 200
column username format a10
column sql_id format a14
column event format a30
col ospid format 999999
column owner format a12
column object_name format a25
Select /*+rule*/
 s.sid,
 s.username,
 s.sql_id,
 s.event,
 s.status,
 p.kglpnmod As lmode,
 p.kglpnreq As req,
 o.spid     As ospid,
 k.kglnaown As owner,
 k.kglnaobj As object_name
  From v$session_wait w, x$kglpn p, v$session s, v$process o, x$kglob k
 Where w.event Like '%library cache pin%'
   And w.p1raw = '&p1raw'
   And w.p1raw = p.kglpnhdl
   And p.kglpnreq = 0
   And p.kglpnmod > 0
   And p.kglpnuse = s.saddr
   And s.paddr = o.addr
   And p.kglpnhdl = k.kglhdadr;

等待较多的情况下,通过多次执行上面对应的两个脚本梳理出阻塞关系,最终查询holder在做什么,然后kill处理。RAC中大多数情况会查到最后发现没有阻塞对象,这时候需要根据梳理到最后的p1raw查看保护的对象是什么。

set lines 160 pages 999
col kglnaown for a12
col kglnaobj for a25
select kglnaown, kglnaobj from x$kglob where kglhdadr = '&p1raw';

根据这个对象再在本节点复查一遍锁情况

set lines 160 pages 999
col owner for a12
col name for a25
select session_id sid, owner, name, type,mode_held held, mode_requested request
  from dba_ddl_locks
 where name like '%PIN_TEST%';

上面的name替换成实际值,如果查到的会话动作都是之前已经梳理过的阻塞关系,那么说明阻塞者可能再另外的节点上。

检查另一个节点

set lines 160 pages 999
col owner for a12
col name for a25
select session_id sid, owner, name, type,mode_held held, mode_requested request
  from dba_ddl_locks
 where name like '%PIN_TEST%';

上面的name替换成实际值,记录下锁情况,再查看这个节点上的等待:

set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
  From v$session
 Where event Not In ('SQL*Net message from client',
                     'rdbms ipc message',
                     'rdbms ipc message',
                     'SQL*Net message to client',
                     'pmon timer',
                     'smon timer',
                     'wakeup time manager',
                     'gcs remote message')
   And event Not Like 'Streams AQ:%'
 Order By event, sql_id, p1;

复杂的环境下多个节点都会有library cache lock/pin等待,根据前面查找pin/lock阻塞者的方法继续梳理,直到找到其他等待的holder。

另外一种查法,适合阻塞不多,通过对象查找的kglhdadr不多从而很容易确认session的情况

set lines 160 pages 999
col kglnaown for a12
col kglnaobj for a25
select kglhdadr,kglnaown, kglnaobj from x$kglob where kglnaobj = 'PIN_TEST';

 

--代入查到的kglhdadr
set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
 From v$session s
 Where s.saddr In
       (Select kgllkuse From dba_kgllock Where kgllkhdl = '00000000A8411CF8');

根据实际情况严重程度进行如下紧急处理:

a)   对所有导致library cache lock/pin的session进行kill,解决堵塞情况。

注意在os级别kill之前,先用ps命令查看一下该进程,如果是DB进程,不可随意kill,否则会导致系统crash

library cache lock/pin

一、故障现象

某些session执行操作被堵塞,检查event发现’library cache lock/pin’等待;

二、可能故障原因

l  library cache lock/pin发生在多个session对相同library cache对象进行争用发生,一般来说在存储过程编译过程中发生并堵塞编译。

三、分析思路

此类问题主要是由于并发执行对象编译导致的,解决思路就是将编译动作串行执行,减少并发争用。同时,后续需要查询此类操作为什么发起,在业务高峰期应当避免。 

四、应急处理

在RAC环境中,Library cache lock/pin与enqueue不同,他不是全局等待。首先我们在其中一个发生library cache lock/pin等待的节点上查看等待事件:

set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
  From v$session
 Where event Not In ('SQL*Net message from client',
                     'rdbms ipc message',
                     'rdbms ipc message',
                     'SQL*Net message to client',
                     'pmon timer',
                     'smon timer',
                     'wakeup time manager',
                    'gcs remote message')
   And event Not Like 'Streams AQ:%'
 Order By event, sql_id, p1;

根据p1raw查找library cache lock/pin的阻塞对象:

--library cache lock
set linesize 200 pagesize 999
column username format a10
column program format a30
column machine format a15
column sql_id format a14
column event format a30
column owner format a12
column object_name format a25
select s.sid, s.username, s.program, s.machine, s.sql_id, s.event,
       s.status, l.kgllktype as type, l.kgllkmod as lmode,
       o.kglnaown as owner, o.kglnaobj as object_name
 from dba_kgllock l, x$kglob o, v$session s
where l.kgllkhdl = '&p1raw'
  and l.kgllkreq = 0
  and l.kgllkmod > 1
  and l.kgllkhdl = o.kglhdadr
  and s.SADDR = l.kgllkuse;
 

--library cache pin
set linesize 160
set linesize 200
column username format a10
column sql_id format a14
column event format a30
col ospid format 999999
column owner format a12
column object_name format a25
Select /*+rule*/
 s.sid,
 s.username,
 s.sql_id,
 s.event,
 s.status,
 p.kglpnmod As lmode,
 p.kglpnreq As req,
 o.spid     As ospid,
 k.kglnaown As owner,
 k.kglnaobj As object_name
  From v$session_wait w, x$kglpn p, v$session s, v$process o, x$kglob k
 Where w.event Like '%library cache pin%'
   And w.p1raw = '&p1raw'
   And w.p1raw = p.kglpnhdl
   And p.kglpnreq = 0
   And p.kglpnmod > 0
   And p.kglpnuse = s.saddr
   And s.paddr = o.addr
   And p.kglpnhdl = k.kglhdadr;

等待较多的情况下,通过多次执行上面对应的两个脚本梳理出阻塞关系,最终查询holder在做什么,然后kill处理。RAC中大多数情况会查到最后发现没有阻塞对象,这时候需要根据梳理到最后的p1raw查看保护的对象是什么。

set lines 160 pages 999
col kglnaown for a12
col kglnaobj for a25
select kglnaown, kglnaobj from x$kglob where kglhdadr = '&p1raw';

根据这个对象再在本节点复查一遍锁情况

set lines 160 pages 999
col owner for a12
col name for a25
select session_id sid, owner, name, type,mode_held held, mode_requested request
  from dba_ddl_locks
 where name like '%PIN_TEST%';

上面的name替换成实际值,如果查到的会话动作都是之前已经梳理过的阻塞关系,那么说明阻塞者可能再另外的节点上。

检查另一个节点

set lines 160 pages 999
col owner for a12
col name for a25
select session_id sid, owner, name, type,mode_held held, mode_requested request
  from dba_ddl_locks
 where name like '%PIN_TEST%';

上面的name替换成实际值,记录下锁情况,再查看这个节点上的等待:

set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
  From v$session
 Where event Not In ('SQL*Net message from client',
                     'rdbms ipc message',
                     'rdbms ipc message',
                     'SQL*Net message to client',
                     'pmon timer',
                     'smon timer',
                     'wakeup time manager',
                     'gcs remote message')
   And event Not Like 'Streams AQ:%'
 Order By event, sql_id, p1;

复杂的环境下多个节点都会有library cache lock/pin等待,根据前面查找pin/lock阻塞者的方法继续梳理,直到找到其他等待的holder。

另外一种查法,适合阻塞不多,通过对象查找的kglhdadr不多从而很容易确认session的情况

set lines 160 pages 999
col kglnaown for a12
col kglnaobj for a25
select kglhdadr,kglnaown, kglnaobj from x$kglob where kglnaobj = 'PIN_TEST';

 

--代入查到的kglhdadr
set linesize 160
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       last_call_et
 From v$session s
 Where s.saddr In
       (Select kgllkuse From dba_kgllock Where kgllkhdl = '00000000A8411CF8');

根据实际情况严重程度进行如下紧急处理:

a)   对所有导致library cache lock/pin的session进行kill,解决堵塞情况。

注意在os级别kill之前,先用ps命令查看一下该进程,如果是DB进程,不可随意kill,否则会导致系统crash

read by other session

一、故障现象

某些SQL语句执行缓慢,检查event发现’read by other session’等待;

二、可能故障原因

read by other session说明多个session正在相同的block上进行读的争用,说明当前存在读的hot block,一般同db file sequential read或db file scattered read 同时出现。

三、分析思路

四、应急处理

在read by other session发生的实例上执行查询:

set linesize 180
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       username,
       event,
       sql_id,
       sql_child_number,
       p1,
       p2,
       p3,
       row_wait_obj#,
       last_call_et
  From v$session
 Where event event In ('db file sequential read',
                 'read by other session',
                'db file scattered read')
 Order By event, sql_id, p1;

找到在hot block上读争用的SQL,对SQL语句的执行计划进行检查。

根据实际情况严重程度进行如下紧急处理:

a)   如果SQL语句比较分散,对集中的P1和P2的对象(可以根据ROW_WAIT_OBJ#去查对象)进行检查,是否存在统计信息过期和索引失效导致热点问题,以及对象级别上是否可以采用分区等方式分散IO。

--根据row_wait_obj#查对象,速度较快,但有时候结果不一定准确

set lines 160 pages 999
col owner for a20
col object_name for a20
col subobject_name for a20
col object_type for a20
col lastddltime for a30
select owner,
       object_name,
       subobject_name,
       object_type,
       status,
       to_char(last_ddl_time, 'yyyy-mm-dd hh24:mi:ss') lastddltime
  from dba_objects
 where object_id=&obj;

--根据p1和p2查对象,速度慢,结果会更准确

set lines 160
set pages 999
col owner format a30
col segment_name format a40
col segment_type format a30
col partition_name format a40
Select owner, segment_name, segment_type, partition_name
  From dba_extents
 Where file_id = &file_id
   And &block_id Between block_id And block_id + blocks - 1;

b)   如果SQL语句比较集中,则直接检查SQL,是否存在执行计划错误。处理详见5.1

c)   如果问题是由于大量相同SQL批量执行导致,联系相关应用及时处理并发过高的异常。

d)   如果是新上的业务SQL,并且性能本身就很差,联系相关应用及时处理。

latch: cache buffers chains

一、故障现象

二、可能故障原因

latch: cache buffers chains 表明当前SQL执行过程中产生热块;

三、分析思路

此类问题主要是由于业务并发高导致或SQL本身执行计划异常或者SQL本身性能问题造成。

四、应急处理

在latch: cache buffers chains发生的实例上执行查询:

set linesize 180
set pagesize 999
column username format a16
column event format a40
column p1 format 999999999999999999999
column p2 format 999999999999999
column p3 format 99999999999
column sql_child_number format 9999 heading 'SQL Child|Number'
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       p1raw,
       p2,
       row_wait_obj#,
       last_call_et
  From v$session
 Where event In ('latch: cache buffers chains')
 Order By event, sql_id, p1;

找到读争用的SQL,对SQL语句的执行计划进行检查。

根据实际情况严重程度进行如下紧急处理:

a)   如果SQL语句比较分散,对集中的P1RAW的对象(可以根据ROW_WAIT_OBJ#去查对象)进行检查,是否存在统计信息过期和索引失效导致热点问题,以及对象级别上是否可以采用分区等方式分散IO。如存在执行计划错误,对相关SQL执行计划进行purge并重新解析;

--根据row_wait_obj#查对象,速度较快,但有时候结果不一定准确

set lines 160 pages 999
col owner for a20
col object_name for a20
col subobject_name for a20
col object_type for a20
col lastddltime for a30
select owner,
       object_name,

       subobject_name,
       object_type,
       status,
       to_char(last_ddl_time, 'yyyy-mm-dd hh24:mi:ss') lastddltime
  from dba_objects
 where object_id=&obj;

--根据p1raw查对象,速度慢,结果会更准确

set lines 160 pages 999
col object_name for a30
Select a.hladdr,
       a.file#,
       a.dbablk,
      a.tch,
       a.obj,
       b.owner || '.' || b.object_name object_name
  From x$bh a, dba_objects b
 Where (a.obj = b.object_id Or a.obj = b.data_object_id)
   And a.hladdr In ('&&p1raw')
Union
Select hladdr, file#, dbablk, tch, obj, Null
  From x$bh
 Where obj In (Select obj
                 From x$bh
                Where hladdr In ('&&p1raw')
               Minus
               Select object_id
                 From dba_objects
               Minus
               Select data_object_id
                 From dba_objects)
   And hladdr In ('&p1raw')
 Order By 4 desc;

b)   如果SQL语句比较集中,则直接检查SQL,是否存在执行计划错误。处理详见SQL性能问题应急处理

c)   如果问题是由于大量相同SQL批量执行导致,联系相关应用及时处理并发过高的异常。

d)   如果是新上的业务SQL,并且性能本身就很差,联系相关应用及时处理,

enq: TX - index contention

一、故障现象

某些DML语句执行缓慢,查询等待事件出现“enq: TX - index contention”等待。

二、可能故障原因

l  应用并发高,插入时索引块分裂

三、分析思路

四、应急处理

在enq: TX - index contention发生的实例上在查找是哪些SQL语句导致大量的争用:

set linesize 180
set pagesize 999
column username format a16
column event format a40
column sql_child_number format 9999 heading 'SQL Child|Number'
column blocking_instance format 9999 heading 'Blocker|Instance'
column blocking_session format 99999 heading 'Blocker|Session'
column FINAL_BLOCKING_INSTANCE format 9999 heading 'F_Blocker|Instance'
column FINAL_BLOCKING_SESSION format 99999 heading 'F_Blocker|Session'
column "Lock" format a6
column "Mode" format a6
Select sid,
       seq#,
       username,
       event,
       sql_id,
       sql_child_number,
       chr(to_char(bitand(p1, -16777216)) / 16777215) ||
       chr(to_char(bitand(p1, 16711680)) / 65535) "Lock",
       to_char(bitand(p1, 65535)) "Mode",
       blocking_instance,
       blocking_session,
       final_blocking_instance,
       final_blocking_session,
       last_call_et
  From v$session
 Where event = ' enq: TX - index contention’;

查看执行计划,检查这些SQL执行中涉及的对象,对象上涉及的索引对象又是哪些,生成查看执行计划的语句如下:

set lines 160 pages 999
set long 99999
col PLAN_TABLE_OUTPUT for a160
select sql_fulltext,
       parsing_schema_name,
       'select * from table(dbms_xplan.display_cursor(''' || sql_id ||
       ''',' || child_number || ',''all iostats last +PEEKED_BINDS''));' xplan
  from v$sql
 where sql_id = '&sql_id';

根据语句explain plan for生成到plan_table中,并获取涉及的索引:
explain plan for statement
set lines 200 pages 999
col tab format a40
col ind format a50
col column_name format a25
col column_position format 9
break on tab
with t1 as
 (select /*+ materialize */ distinct object_owner from plan_table where object_type = 'TABLE'),
t2 as
 (select /*+ materialize */ distinct object_name from plan_table where object_type = 'TABLE')
select table_owner || '.' || table_name tab,
       index_owner || '.' || index_name ind,
       column_name,
       column_position
  from dba_ind_columns
 where table_owner in (select * from t1)
   and table_name in (select * from t2)
 order by 1, 2, 3, 4;

或者采取更简便的方法,收集问题时段的AWR报告,检查Segments by Row Lock Waits部分指标:

手工产生一个AWR快照的方法如下:

exec DBMS_WORKLOAD_REPOSITORY.create_snapshot();

根据实际情况严重程度进行如下紧急处理:

a)   降低应用并发度,降低index争用;

b)   使用反向索引重建,或使用hash分区重建索引;

c)   如索引字段使用sequence,可考虑增加sequence cache来增加键值跳度避免争用。

相关文章

pt-kill工具的使用
pt-ioprofile工具包的使用
数据库管理-第216期 Oracle的高可用-01(20240703)
DBMS_REPAIR EXAMPLE SCRIPT WITH PARTITION
数据库事务的四大特性: ACID 
使用BBED修复损坏的SYSTEM文件头

发布评论