Oracle ASM磁盘组被误删除!他跑,她追,他插翅难飞!

经常删库的小伙伴们,经常听说一句话:"删库跑路,三年起步!"

说的就是变更不规范,亲人两行泪的故事。

那么究竟是什么导致故事变成了事故呢,这一切的背后究竟是人性的扭曲、良心的泯灭还是道德的沦丧呢?让我们回顾一下案发现场:

注意:

本故事纯属虚构,如有雷同,纯属巧合,请勿模仿。

又快到下班时间了,突然,DBA小C接到了客户的电话,反馈数据库服务器A系统磁盘空间快不足了,之前给ASM加的两块盘sdh,sdi还没开始用,先踢出ASM磁盘组,临时扩容下系统盘。

接到任务,粗心大意的小C只想快速完成工作,避免加班,所以他匆忙检查了sdh,sdi盘已经被加到新建的+CJC磁盘组里,既然客户说没在用,那就不检查了,直接删除+CJC磁盘组。

说干就干,直接执行drop diskgroup:

发现active diskgroup磁盘删除不了:

    [grid@cjc-db-02 ~]$ sqlplus as sysasm
    SQL> drop diskgroup CJC including contents;
    ORA-15039: diskgroup not dropped
    ORA-15027: active use of diskgroup "CJC" precludes its dismount

    一不做二不休,小C想到,别耽误我下班,直接停库操作:

      [oracle@cjc-db-02 ~]$ srvctl status database -d cjc
      Database is running.
      [oracle@cjc-db-02 ~]$ srvctl stop database -d cjc
      [oracle@cjc-db-02 ~]$ srvctl status database -d cjc
      Database is not running.

      再次执行,成功删除掉了+CJC磁盘组:

        [grid@cjc-db-02 ~]$ sqlplus as sysasm
        SQL> drop diskgroup CJC including contents;
        Diskgroup dropped.

        启动数据库,有报错:

          [oracle@cjc-db-02 ~]$ srvctl start database -d cjc
          PRCR-1079 : Failed to start resource ora.cjc.db
          CRS-2640: Required resource 'ora.CJC.dg' is missing.
          SQL> startup
          ORA-39511: Start of CRS resource for instance '222' failed with error:[CRS-2640: Required resource 'ora.CJC.dg' is missing.
          CRS-0222: Resource 'ora.cjc.db' has dependency error.
          clsr_start_resource:260 status:222
          clsrapi_start_db:start_asmdbs status:222

          尝试重新启动has,发现问题仍然没有解决:

            [root@cjc-db-02 bin]# ./crsctl stop has
            [root@cjc-db-02 bin]# ./crsctl start has

            原来是因为踢盘后,CRS信息还没有更新,需要通过 stvctl modify ...命令更新后才能启动数据库。

            正当小C准备更新CRS信息时,客户电话又打来了,先别踢盘了,业务同事反馈,sdh,sdi昨天晚上已经开始写业务数据了,不能删除!

            啊,这,,,新上线的库还没有来得及备份啊!

            小C抽了根烟冷静了一下,默默打开了BOSS直聘,过了一会又打开了12306,最后又抱着试一试的心态在浏览器上搜索了ASM磁盘组误操作后如何恢复?

            终于看到了希望,原来drop diskgroup只是逻辑删除,数据被覆盖之前还可以挽救回来,最终小C通过kfed工具成功的挽回了数据...

            本次案例小C深有感触,明白了对生产环境要时刻抱有敬畏之心,并更新了自己的微信签名,从原来的"乾坤未定,你我皆是黑马!"改成了"乾坤已定,你我皆是牛马!"

            下面,让我们看下如何使用kfed工具恢复误删除的磁盘组:

            环境说明:

              DB:Oracle 19.22单机+ASM
              OS:Oracle Linux 7.6

              启动has:

                [root@cjc-db-02 bin]# mount -o remount,size=5G dev/shm
                [root@cjc-db-02 bin]# ./crsctl start has

                查看资源状态:

                  [root@cjc-db-02 bin]# ./crsctl stat res -t
                  --------------------------------------------------------------------------------
                  Name Target State Server State details
                  --------------------------------------------------------------------------------
                  Local Resources
                  --------------------------------------------------------------------------------
                  ora.DATA.dg
                  ONLINE ONLINE cjc-db-02 STABLE
                  ora.LISTENER.lsnr
                  ONLINE ONLINE cjc-db-02 STABLE
                  ora.asm
                  ONLINE ONLINE cjc-db-02 Started,STABLE
                  ora.ons
                  OFFLINE OFFLINE cjc-db-02 STABLE
                  --------------------------------------------------------------------------------
                  Cluster Resources
                  --------------------------------------------------------------------------------
                  ora.cjc.db
                  1 OFFLINE OFFLINE Instance Shutdown,ST
                  ABLE
                  ora.cssd
                  1 ONLINE ONLINE cjc-db-02 STABLE
                  ora.diskmon
                  1 OFFLINE OFFLINE STABLE
                  ora.evmd
                  1 ONLINE ONLINE cjc-db-02 STABLE
                  --------------------------------------------------------------------------------

                  查看磁盘组信息:

                    [grid@cjc-db-02 ~]$ sqlplus as sysasm
                    set line 300
                    col name for a15
                    select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;
                    GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB
                    ------------ --------------- ----------- ------ ---------- ----------
                    1 DATA MOUNTED EXTERN 12288 9740

                    查看磁盘信息:

                      col path for a30
                      select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;
                      GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH
                      ------------ ----------- -------- ------- ---------- ---------- ---------- --------------- ------------------------------
                      0 0 NORMAL UNKNOWN2048 0 0 dev/sdk
                      0 1 NORMAL UNKNOWN2048 0 0 dev/sdh
                      0 2 NORMAL UNKNOWN2048 0 0 dev/sdi
                      0 3 NORMAL UNKNOWN2048 0 0 dev/sdj
                      1 0 NORMAL UNKNOWN2048 2048 1624 DATA_0000 dev/sdb
                      1 1 NORMAL UNKNOWN2048 2048 1628 DATA_0001 dev/sdc
                      1 2 NORMAL UNKNOWN2048 2048 1620 DATA_0002 dev/sdd
                      1 3 NORMAL UNKNOWN2048 2048 1624 DATA_0003 dev/sde
                      1 4 NORMAL UNKNOWN2048 2048 1632 DATA_0004 dev/sdf
                      1 5 NORMAL UNKNOWN2048 2048 1612 DATA_0005 dev/sdg


                      10 rows selected.

                      新建磁盘组 CJC:

                        [grid@cjc-db-02 ~]$ asmca

                        查看磁盘组信息:

                          set line 300
                          col name for a15
                          select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;
                          GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB
                          ------------ --------------- ----------- ------ ---------- ----------
                          1 DATA MOUNTED EXTERN 12288 9740
                          2 CJC MOUNTED EXTERN 4096 3988

                            col path for a30
                            select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;


                            GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH
                            ------------ ----------- -------- ------- ---------- ---------- ---------- ------------------------------ ------------------------------
                            0 0 NORMAL UNKNOWN2048 0 0 /dev/sdk
                            0 1 NORMAL UNKNOWN2048 0 0 /dev/sdj
                            1 0 NORMAL UNKNOWN2048 2048 1620 DATA_0000 /dev/sdb
                            1 1 NORMAL UNKNOWN2048 2048 1624 DATA_0001 /dev/sdc
                            1 2 NORMAL UNKNOWN2048 2048 1612 DATA_0002 /dev/sdd
                            1 3 NORMAL UNKNOWN2048 2048 1624 DATA_0003 /dev/sde
                            1 4 NORMAL UNKNOWN2048 2048 1624 DATA_0004 /dev/sdf
                            1 5 NORMAL UNKNOWN2048 2048 1608 DATA_0005 /dev/sdg
                            2 0 NORMAL UNKNOWN2048 2048 1980 CJC_0000 /dev/sdh
                            2 1 NORMAL UNKNOWN2048 2048 1996 CJC_0001 /dev/sdi


                            10 rows selected.

                            新增测试数据:

                              [oracle@cjc-db-02 ~]$ sqlplus / as sysdba
                              select name from v$dbfile;
                              NAME
                              --------------------------------------------------------------------------------
                              +DATA/CJC/DATAFILE/users.260.1165422711
                              +DATA/CJC/DATAFILE/undotbs1.259.1165422693
                              +DATA/CJC/DATAFILE/system.257.1165422411
                              +DATA/CJC/DATAFILE/sysaux.258.1165422581

                                create tablespace CJC datafile '+CJC' size 10M;
                                create user CJC identified by "a" default tablespace CJC;
                                grant dba to CJC;
                                conn CJC/a
                                create table t1 as select level as id from dual connect by level/tmp/sdh01

                                块号1备份

                                  kfed read /dev/sdh blkn=1 >/tmp/sdh02

                                  aun1备份

                                    kfed read /dev/sdh aun=1 >/tmp/sdh03

                                    磁盘头信息备份

                                      kfed read /dev/sdi >/tmp/sdi01

                                      块号1备份

                                        kfed read /dev/sdi blkn=1 >/tmp/sdi02

                                        aun1备份

                                          kfed read /dev/sdi aun=1 >/tmp/sdi03

                                          删除磁盘组,模拟误操作:

                                            [grid@cjc-db-02 ~]$ sqlplus / as sysasm
                                            SQL> drop diskgroup CJC including contents;
                                            ORA-15039: diskgroup not dropped
                                            ORA-15027: active use of diskgroup "CJC" precludes its dismount

                                            停库后继续删除:

                                              [oracle@cjc-db-02 ~]$ srvctl status database -d cjc
                                              Database is running.
                                              [oracle@cjc-db-02 ~]$ srvctl stop database -d cjc
                                              [oracle@cjc-db-02 ~]$ srvctl status database -d cjc
                                              Database is not running.

                                              成功删除掉CJC磁盘组:

                                                [grid@cjc-db-02 ~]$ sqlplus / as sysasm
                                                SQL> drop diskgroup CJC including contents;
                                                Diskgroup dropped.

                                                启动数据库报错:

                                                  [oracle@cjc-db-02 ~]$ srvctl start database -d cjc
                                                  PRCR-1079 : Failed to start resource ora.cjc.db
                                                  CRS-2640: Required resource 'ora.CJC.dg' is missing.


                                                  SQL> startup
                                                  ORA-39511: Start of CRS resource for instance '222' failed with error:[CRS-2640: Required resource 'ora.CJC.dg' is missing.
                                                  CRS-0222: Resource 'ora.cjc.db' has dependency error.
                                                  clsr_start_resource:260 status:222
                                                  clsrapi_start_db:start_asmdbs status:222

                                                  恢复误删除的磁盘组:

                                                  查看磁盘组信息:

                                                    set line 300
                                                    col name for a15
                                                    select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;


                                                    GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB
                                                    ------------ --------------- ----------- ------ ---------- ----------
                                                    1 DATA MOUNTED EXTERN 12288 9704

                                                      col path for a30
                                                      select GROUP_NUMBER,DISK_NUMBER,STATE,REDUNDANCY,OS_MB,TOTAL_MB,FREE_MB,NAME,PATH from v$asm_disk order by 1,2;SQL>


                                                      GROUP_NUMBER DISK_NUMBER STATE REDUNDA OS_MB TOTAL_MB FREE_MB NAME PATH
                                                      ------------ ----------- -------- ------- ---------- ---------- ---------- --------------- ------------------------------
                                                      0 0 NORMAL UNKNOWN2048 0 0 /dev/sdk
                                                      0 1 NORMAL UNKNOWN2048 0 0 /dev/sdh
                                                      0 2 NORMAL UNKNOWN2048 0 0 /dev/sdi
                                                      0 3 NORMAL UNKNOWN2048 0 0 /dev/sdj
                                                      1 0 NORMAL UNKNOWN2048 2048 1620 DATA_0000 /dev/sdb
                                                      1 1 NORMAL UNKNOWN2048 2048 1624 DATA_0001 /dev/sdc
                                                      1 2 NORMAL UNKNOWN2048 2048 1612 DATA_0002 /dev/sdd
                                                      1 3 NORMAL UNKNOWN2048 2048 1620 DATA_0003 /dev/sde
                                                      1 4 NORMAL UNKNOWN2048 2048 1620 DATA_0004 /dev/sdf
                                                      1 5 NORMAL UNKNOWN2048 2048 1608 DATA_0005 /dev/sdg


                                                      10 rows selected.

                                                      恢复

                                                      --备份磁盘头信息

                                                        kfed read /dev/sdh >/tmp/sdh01xxx

                                                        块号1备份

                                                          kfed read /dev/sdh blkn=1 >/tmp/sdh02xxx

                                                          aun1备份

                                                            kfed read /dev/sdh aun=1 >/tmp/sdh03xxx

                                                            磁盘头信息备份

                                                              kfed read /dev/sdi >/tmp/sdi01xxx

                                                              块号1备份

                                                                kfed read /dev/sdi blkn=1 >/tmp/sdi02xxx

                                                                aun1备份

                                                                  kfed read /dev/sdi aun=1 >/tmp/sdi03xxx

                                                                  通过diff命令进行对比:

                                                                    [grid@cjc-db-02 ~]$ diff -C 1 /tmp/sdh01 /tmp/sdh01xxx|grep "kfdhdb.hdrsts"
                                                                    ! kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
                                                                    ! kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

                                                                      [grid@cjc-db-02 ~]$ diff -C 1 /tmp/sdi01 /tmp/sdi01xxx|grep "kfdhdb.hdrsts"
                                                                      ! kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
                                                                      ! kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

                                                                      通过对比删除前和删除后磁盘头数据的不同部分,可以看到kfdhdb.hdrsts行有差异。

                                                                      尝试改回删除前磁盘头的信息:

                                                                      --修改sdh01xxx文件

                                                                      kfdhdb.hdrsts:状态值从4改为3,状态名称从KFDHDR_FORMER改为KFDHDR_MEMBER

                                                                      --修改sdi01xxx文件

                                                                      kfdhdb.hdrsts:状态值从4改为3,状态名称从KFDHDR_FORMER改为KFDHDR_MEMBER

                                                                        [grid@cjc-db-02 ~]$ vi /tmp/sdh01xxx

                                                                          kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

                                                                          改成

                                                                            kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

                                                                              [grid@cjc-db-02 ~]$ vi /tmp/sdi01xxx

                                                                                kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

                                                                                改成

                                                                                  kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

                                                                                  合并修改:

                                                                                    [grid@cjc-db-02 ~]$ kfed merge /dev/sdh text=/tmp/sdh01xxx
                                                                                    [grid@cjc-db-02 ~]$ kfed merge /dev/sdi text=/tmp/sdi01xxx

                                                                                    检查磁盘组:可以看到CJC磁盘组了

                                                                                      [grid@cjc-db-02 ~]$ sqlplus / as sysasm
                                                                                      set line 300
                                                                                      col name for a15
                                                                                      select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;
                                                                                      GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB
                                                                                      ------------ --------------- ----------- ------ ---------- ----------
                                                                                      0 CJC DISMOUNTED 0 0
                                                                                      1 DATA MOUNTED EXTERN 12288 9704

                                                                                      挂载磁盘组

                                                                                        SQL> alter diskgroup CJC mount;

                                                                                          [grid@cjc-db-02 trace]$ tail -100f alert_+ASM.log
                                                                                          .....
                                                                                          2024-04-06T14:43:30.834474+08:00
                                                                                          SUCCESS: alter diskgroup CJC mount
                                                                                          WARNING: unknown state for diskgroup resource ora.CJC.dg, Return Value: 3

                                                                                          查看

                                                                                            set line 300
                                                                                            col name for a15
                                                                                            select GROUP_NUMBER,NAME,STATE,TYPE,TOTAL_MB,FREE_MB from v$asm_diskgroup;SQL> SQL>


                                                                                            GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB
                                                                                            ------------ --------------- ----------- ------ ---------- ----------
                                                                                            1 DATA MOUNTED EXTERN 12288 9704
                                                                                            2 CJC MOUNTED EXTERN 4096 3976

                                                                                            可以正常启动数据库:

                                                                                              [oracle@cjc-db-02 ~]$ sqlplus / as sysdba
                                                                                              SQL*Plus: Release 19.0.0.0.0 - Production on Sat Apr 6 14:44:53 2024
                                                                                              Version 19.22.0.0.0
                                                                                              Copyright (c) 1982, 2023, Oracle.  All rights reserved.
                                                                                              Connected to an idle instance.


                                                                                              SQL> startup
                                                                                              ORACLE instance started.


                                                                                              Total System Global Area 1174402440 bytes
                                                                                              Fixed Size 8938888 bytes
                                                                                              Variable Size 369098752 bytes
                                                                                              Database Buffers 788529152 bytes
                                                                                              Redo Buffers 7835648 bytes
                                                                                              Database mounted.
                                                                                              Database opened.

                                                                                              可以看到CJC磁盘组的数据

                                                                                                SQL> select * from cjc.t1;


                                                                                                ID
                                                                                                ----------
                                                                                                1
                                                                                                2
                                                                                                3
                                                                                                4
                                                                                                5
                                                                                                6
                                                                                                7
                                                                                                8
                                                                                                9
                                                                                                10


                                                                                                10 rows selected.

                                                                                                参考:微信公众号"数据库运维之道"文章:【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组

                                                                                                【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组

                                                                                                数据库运维之道,公众号:数据库运维之道【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组

                                                                                                ###chenjuchao 20240406###