点击上方“IT那活儿”公众号--专注于企业全栈运维技术分享,不管IT什么活儿,干就完了!!!
背 景Oracle中,在备份端与主库同步的过程中由于网络原因或磁盘问题导致一个或多个归档日志丢失,进而dataguard同步无法继续。很多人都会选择重新全库恢复,并重新搭建备库。那么除了重建备库的方式还有一种高效的方式通过对主库进行基于SCN的增量备份前滚物理备库。
以下就是模拟该场景的实验记录。
模拟环境数据库版本:
- Oracle 11.2.0.4
主:
- 192.168.80.202 source
- 192.168.80.203 source-vip
- 10.10.10.11 source-priv
- 192.168.80.204 source-scan
备:
- 192.168.80.212 sourcedg
- 192.168.80.213 sourcedg-vip
- 10.10.10.31 sourcedg-priv
- 192.168.80.214 sourcedg-scan
主库磁盘组信息:
- CRSDG
- DATA
备库磁盘组信息:
- CRSDG
-
DATADG
实验步骤3.1 模拟故障备库操作,备库取消归档应用,让备库处于只读模式,然后主库归档丢失。--关闭物理备库的ADG实时日志应用:
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
Database altered.
SQL> select dbid,name,current_scn,protection_mode,protection_level,database_role,force_logging,open_mode,switchover_status from v$database;
DBID NAME CURRENT_SCN PROTECTION_MODE PROTECTION_LEVEL DATABASE_ROLE FOR OPEN_MODE SWITCHOVER_STATUS
---------- --------- ----------- -------------------- -------------------- ---------------- --- -------------------- --------------------
3257785855 SOURCE 1756123 MAXIMUM PERFORMANCE MAXIMUM PERFORMANCE PHYSICAL STANDBY YES READ ONLY NOT ALLOWED
主库配置log_archive_dest_state_2的状态为defer,这里为了模拟实验,不将主库的归档传输过去。但实际情况下可能是由于网络故障,备库挂掉等等情况导致。3.2 主库模拟业务
SQL> ALTER system SET log_archive_dest_state_2 = 'defer';
System altered.
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL> /
System altered.
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +DATA
Oldest online log sequence 133
Next log sequence to archive 136
Current log sequence 136
SQL>
SQL> /
System altered.
SQL> SQL>
SQL> /
System altered.
SQL>
SQL> /
System altered.
SQL>
SQL> select count(1) from wangjh.bb;
COUNT(1)
----------
9996
SQL>
SQL>
SQL> alter system switch logfile;
System altered.
SQL> /
System altered.
SQL>
SQL> delete from wangjh.bb where rownum commit;
Commit complete.
SQL> select count(1) from wangjh.bb;
COUNT(1)
----------
4997
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +DATA
Oldest online log sequence 140
Next log sequence to archive 143
Current log sequence 143
1)备库产生GAP2)查看主库归档情况
SQL> col name for a80
SQL> set linesize 9999 pagesize 9999
SQL> SELECT dest_id,
2 THREAD#,
3 NAME,
4 sequence#,
5 archived,
6 applied,
7 a.NEXT_CHANGE#
8 FROM v$archived_log a
9 WHERE a.sequence# >= 130
10 AND resetlogs_change# = (SELECT d.RESETLOGS_CHANGE# FROM v$database d)
11 and a.dest_id=1
12 ORDER BY a.THREAD#,a.sequence#;
DEST_ID THREAD# NAME SEQUENCE# ARC APPLIED NEXT_CHANGE#
---------- ---------- -------------------------------------------------------------------------------- ---------- --- --------- ------------
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_130.401.1158487567 130 YES NO 1751649
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_131.402.1158488767 131 YES NO 1753863
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_132.403.1158489969 132 YES NO 1755667
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_133.404.1158490481 133 YES NO 1756199
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_134.405.1158490481 134 YES NO 1756203
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_135.406.1158490483 135 YES NO 1756207
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_136.407.1158490493 136 YES NO 1756221
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_137.408.1158490495 137 YES NO 1756225
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_138.409.1158490495 138 YES NO 1756228
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_139.410.1158490651 139 YES NO 1756514
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_140.411.1158490653 140 YES NO 1756520
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_141.412.1158490683 141 YES NO 1756637
1 1 +DATA/source/archivelog/2024_01_17/thread_1_seq_142.413.1158490687 142 YES NO 1756643
13 rows selected
3)查看备库归档情况
SQL> col name for a80
SQL> set linesize 9999 pagesize 9999
SQL> SELECT dest_id,
2 THREAD#,
3 NAME,
4 sequence#,
5 archived,
6 applied,
7 a.NEXT_CHANGE#
8 FROM v$archived_log a
9 WHERE a.sequence# >= 130
10 AND resetlogs_change# = (SELECT d.RESETLOGS_CHANGE# FROM v$database d)
11 and a.dest_id=1
12 ORDER BY a.THREAD#,a.sequence#;
DEST_ID THREAD# NAME SEQUENCE# ARC APPLIED NEXT_CHANGE#
---------- ---------- -------------------------------------------------------------------------------- ---------- --- --------- ------------
1 1 +DATADG/dgsource/archivelog/2024_01_17/thread_1_seq_130.420.1158487567 130 YES YES 1751649
1 1 +DATADG/dgsource/archivelog/2024_01_17/thread_1_seq_131.422.1158488769 131 YES YES 1753863
1 1 +DATADG/dgsource/archivelog/2024_01_17/thread_1_seq_132.423.1158489969 132 YES YES 1755667
1 1 +DATADG/dgsource/archivelog/2024_01_17/thread_1_seq_133.424.1158490481 133 YES NO 1756199
可以看到,备库已经断档了,134 到142 都没有接收,接下来我们删除主库的归档日志,实验环境我们只模拟删除138 139 两支 归档日志。4)通过grid用户进入到asmcmd中,手动删除138 和 139的归档日志5)主库开启备库的归档
ALTER system SET log_archive_dest_state_2 = 'enable';
6)备库开启ADG实时日志应用
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING
CURRENT LOGFILE DISCONNECT FROM SESSION;
7)同时查看物理备库alert日志日志产生gap 缺少序列号为138和139的归档日志。因为备库是要按顺序去应用这些归档文件。所以中间有断层的情况下。是不会去应用。3.3 恢复步骤1)检查备库GAP情况2)发现有gap后,停止备库的mrp
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
3)确认数据文件头最小的SCN,便于后期进行该SCN的增量备份
SQL> select min(checkpoint_change#) from v$datafile_header
2 where file# not in (select file# from v$datafile where enabled = 'READ ONLY');
MIN(CHECKPOINT_CHANGE#)
-----------------------
1756225
4)主库进行基于SCN 1756225做增量备份
RMAN> BACKUP INCREMENTAL FROM SCN 1756225 DATABASE FORMAT
'/home/oracle/oracle_bk/ORADG11G/ForStandby_%U'tag'FORSTANDBY';
5)将备份集传输至备库
scp * oracle@192.168.80.212:/home/oracle/oracle_bk/ORADG11G/
6)在备库将拷贝过来的文件注册到控制文件信息中7)在备库恢复增量备份8)在主库备份控制文件以standby方式
RMAN> BACKUP CURRENT CONTROLFILE FOR STANDBY FORMAT
'/tmp/ForStandbyCTRL.bck';
9)将备份文件传至备库
scp ForStandbyCTRL.bck oracle@192.168.80.212:/home/oracle/oracle_bk/ORADG11G/
10)在备库查出datafile的信息是为了我们要恢复主库的控制文件,那么备库的数据文件可能不同于主库,所以记录一下备库的datafile信息。11)在备库上,restore备用控制文件12)将备库mount
RMAN> alter database mount;
13)如果数据文件的目录或名字不同于主库,需要catalog到备库中注:如果主备库具有相同的目录和名字,此步骤可以省略。我这里的测试环境datafile的路径主库是+DATA,备库是+DATADG。所以需要做这一步。14)确认在最小的SCN后是否有新的数据文件在主库增加通过以下SQL确认。注:如果返回0行,就可以通过switch 更新备库控制文件中数据文件的正确路径。
SQL> SELECT FILE#,NAME FROM V$DATAFILE WHERE CREATION_CHANGE# > 1756225;
no rows selected
注:如果在这个GAP期间,也就是在最小的SCN后主库增加了数据文件,那么在备库是不会自动创建的。15)在备库上,清理所有redo log日志组
SQL> select GROUP# from v$logfile where TYPE='STANDBY' group by GROUP#;
GROUP#
----------
5
6
7
8
9
SQL> ALTER DATABASE CLEAR LOGFILE GROUP 5;
SQL> ALTER DATABASE CLEAR LOGFILE GROUP 6;
SQL> ALTER DATABASE CLEAR LOGFILE GROUP 7;
…
16)在备库上,启动MRP进程
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
Database altered.
17)查看主备的gap情况如果延迟追平后。那么可以起库。
SQL>select name,value from v$dataguard_stats;
NAME VALUE
-------------------- ----------------------------------------------------------------
transport lag
apply lag +00 00:00:00
apply finish time
estimated startup ti 24 me
18)查看数据已经一致,备库已恢复与主库一致
SQL> select count(1) from wangjh.bb;
COUNT(1)
----------
4997
总 结:
- 1)生产环境要避免这种情况发生。需要做好主备库的监控,如果有GAP产生了,需要及时定位原因(网络,IO,或者是硬件等问题)。
- 2)主备库根据归档的磁盘大小和产生的归档量制定好定时清理归档的脚本。
END