rac集群由于ora.cluster_interconnect.haip资源无法启动,导致集群启动失败问题

2024年 7月 20日 71.7k 0

生产一套Oracle 11.2.0.4 的 RAC DG数据库,由于心跳网络的调整需要,原本能正常运行的,发现网络调整后,集群服务启动不正常。启动过程报错如下:

2024-07-16 13:21:34.094:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(13408)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/agent/ohasd/orarootagent_root//orarootagent_root.log.
2024-07-16 13:21:38.097:
[ohasd(13172)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/ohasd/ohasd.log.
2024-07-16 13:22:38.117:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(13408)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/agent/ohasd/orarootagent_root//orarootagent_root.log.
2024-07-16 13:22:42.120:
[ohasd(13172)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/ohasd/ohasd.log.
2024-07-16 13:23:42.133:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(13408)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/agent/ohasd/orarootagent_root//orarootagent_root.log.
2024-07-16 13:23:46.136:
[ohasd(13172)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/ohasd/ohasd.log.
2024-07-16 13:24:46.149:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(13408)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/agent/ohasd/orarootagent_root//orarootagent_root.log.
2024-07-16 13:24:50.152:
[ohasd(13172)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/sbhis1/ohasd/ohasd.log.
2024-07-16 13:24:50.158:
[ohasd(13172)]CRS-2807:Resource 'ora.asm' failed to start automatically.

从上面的具体报错,可以看到ora.cluster_interconnect.haip这个资源启动失败,导致集群启动失败,查看/u01/app/11.2.0/grid/log/sbhis1/agent/ohasd/orarootagent_root//orarootagent_root.log 这个日志,查找ora.cluster_interconnect.haip启动失败原因,从该日志可以看到以下报错:

2024-07-16 13:20:35.030: [ USRTHRD][1570182912]{0:0:2} PROBE: got conflicting source ip 169.254.98.99, addr fa-16-3e-8e-ec-6e
2024-07-16 13:20:35.030: [ USRTHRD][1570182912]{0:0:2} PROBE: conflict detected src { 169.254.98.99, fa-16-3e-8e-ec-6e }, target { 0.0.0.0, 08-3a-88-cd-66-34 }
2024-07-16 13:20:35.092: [ora.ctssd][1585514240]{0:0:2} [start] with returnbuf
[ clsdmc][1585514240]CLSDMC.C returnbuflen=8, extraDataBuf=40, returnbuf=440A21A0
2024-07-16 13:20:35.094: [ora.ctssd][1585514240]{0:0:2} [start] clsdmc_respget return: status=0, ecode=0, returnbuf=[0x7fa5440a21a0], buflen=8
2024-07-16 13:20:35.094: [ora.ctssd][1585514240]{0:0:2} [start] Utils::getOracleHomeAttrib getEnvVar oracle_home:/u01/app/11.2.0/grid
2024-07-16 13:20:35.094: [ora.ctssd][1585514240]{0:0:2} [start] Utils::getOracleHomeAttrib oracle_home:/u01/app/11.2.0/grid
2024-07-16 13:20:35.095: [ora.ctssd][1585514240]{0:0:2} [start] PID 13890 from /u01/app/11.2.0/grid/ctss/init/sbhis1.pid
2024-07-16 13:20:35.095: [ora.ctssd][1585514240]{0:0:2} [start] }DaemonAgent::start
2024-07-16 13:20:35.095: [ora.ctssd][1585514240]{0:0:2} [start] translateReturnCodes, return = 0, state detail = Checkcb data [0x7fa5440a21a0]: mode[0x40] offset[0 ms].
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Failed to check 169.254.111.70 on bond2
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} (null) category: 0, operation: , loc: , OS error: 0, other:
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Starting Probe for ip 169.254.111.70
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Transitioning to Probe State
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Arp::sProbe {
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Arp::sSend: sending type 1
2024-07-16 13:20:35.530: [ USRTHRD][1570182912]{0:0:2} Arp::sProbe }
2024-07-16 13:20:35.531: [ USRTHRD][1570182912]{0:0:2} PROBE: got conflicting source ip 169.254.111.70, addr fa-16-3e-39-76-01
2024-07-16 13:20:35.531: [ USRTHRD][1570182912]{0:0:2} PROBE: conflict detected src { 169.254.111.70, fa-16-3e-39-76-01 }, target { 0.0.0.0, 08-3a-88-cd-66-34 }
2024-07-16 13:20:36.031: [ USRTHRD][1570182912]{0:0:2} Failed to check 169.254.205.245 on bond2
2024-07-16 13:20:36.031: [ USRTHRD][1570182912]{0:0:2} (null) category: 0, operation: , loc: , OS error: 0, other:
2024-07-16 13:20:36.031: [ USRTHRD][1570182912]{0:0:2} Starting Probe for ip 169.254.205.245
2024-07-16 13:20:36.031: [ USRTHRD][1570182912]{0:0:2} Transitioning to Probe State
[ clsdmc][1585514240]CLSDMC.C returnbuflen=8, extraDataBuf=CC, returnbuf=440A2520
2024-07-16 13:20:36.096: [ora.ctssd][1585514240]{0:0:2} [start] clsdmc_respget return: status=0, ecode=0, returnbuf=[0x7fa5440a2520], buflen=8
2024-07-16 13:20:36.096: [ora.ctssd][1585514240]{0:0:2} [start] Start: Extended check return buffer: "Ì" with length of 8

从上面的关键字报错“PROBE: got conflicting source”,可以从MOS上找到相关的解决方案:

HAIP fails to start if default gateway is configured for VLAN for private network on network switch
Issue: HAIP fails to start if default gateway is configured for VLAN for private network on network switch

orarootagent_root.log shows: PROBE: conflict detected src { 169.254.12.247, }, target { 0.0.0.0,

}

The solution is to remove default gateway setting on network switch for private network (VLAN), refer to Note 1366211.1 for more details.

MOS 1366211.1文档有说明报错原因是该心跳网络所在的VLAN配置了网关,HAIP无法获取169.254.x.x 这个段的IP,也就导致ora.cluster_interconnect.haip启动失败。

1、解决方法一:调整网络

知道了原因,解决起来也就简单了,让网络部把这个RAC的心跳单独弄到一个vlan上,并把该vlan的网关设置去掉,这样调整后,就顺利启动集群。

2、解决方法二:禁用HAIP

上面是一种解决方式,有没有另外的解决方法呢,答案是肯定的,把HAIP这个特性禁用了,是不是就可以了。HAIP这个特性是从11.0.0.2版本开始,Oracle提供的私网的高可用性和负载均衡,需要有多块网卡来做配置。如果你在操作系统层面已经做了网卡绑定配置,如linux bonding ,那其实是可以禁用的。要该如何禁用这个HAIP呢?

(1)禁用HAIP资源

--使用root用户执行以下命令。

--停止所有节点的CRS

/u01/app/11.2.0/grid/bincrsctl stop crs

--依次在每个节点中执行以下命令(节点1执行完毕后再在节点2执行)

# /u01/app/11.2.0/grid/bin/crsctl start crs -excl -nocrs
# /u01/app/11.2.0/grid/bin/crsctl stop res ora.asm -init
# /u01/app/11.2.0/grid/bin/crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init

(2)修改ASM资源的依赖关系
```language
--查看当前ASM资源的关联关系:
# /u01/app/11.2.0/grid/bin/crsctl stat res ora.asm -p -init
NAME=ora.asm
TYPE=ora.asm.type
ACL=owner:grid:rw-,pgrp:oinstall:rw-,other::r--,user:grid:rwx
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%
AUTO_START=restore
CARDINALITY=1
CHECK_ARGS=
CHECK_COMMAND=
CHECK_INTERVAL=1
CHECK_TIMEOUT=30
CLEAN_ARGS=
CLEAN_COMMAND=
DAEMON_LOGGING_LEVELS=
DAEMON_TRACING_LEVELS=
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION="ASM instance"
DETACHED=true
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=3
FAILURE_THRESHOLD=5
GEN_USR_ORA_INST_NAME=+ASM2
HOSTING_MEMBERS=
LOAD=1
LOGGING_LEVEL=1
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
ORA_VERSION=11.2.0.4.0
PID_FILE=
PLACEMENT=balanced
PROCESS_TO_MONITOR=
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=600
SERVER_POOLS=
SPFILE=
START_ARGS=
START_COMMAND=
START_DEPENDENCIES=hard(ora.cssd,ora.cluster_interconnect.haip,ora.ctssd)pullup(ora.cssd,ora.cluster_interconnect.haip,ora.ctssd)weak(ora.drivers.acfs)
START_TIMEOUT=600
STATE_CHANGE_TEMPLATE=
STOP_ARGS=
STOP_COMMAND=
STOP_DEPENDENCIES=hard(intermediate:ora.cssd,shutdown:ora.cluster_interconnect.haip)
STOP_TIMEOUT=600
UNRESPONSIVE_TIMEOUT=180
UPTIME_THRESHOLD=1h
USR_ORA_ENV=
USR_ORA_INST_NAME=
USR_ORA_OPEN_MODE=mount
USR_ORA_OPI=false
USR_ORA_STOP_MODE=immediate
VERSION=11.2.0.3.0
##可以看到ASM资源和HAIP资源的依赖关系。

--修改ASM的关联关系(所有节点执行)
# /u01/app/11.2.0/grid/bin/crsctl modify resource ora.asm -attr "START_DEPENDENCIES='hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)'" -f -init
# /u01/app/11.2.0/grid/bin/crsctl modify resource ora.asm -attr "STOP_DEPENDENCIES=hard(intermediate:ora.cssd)" -f –init

rac集群由于ora.cluster_interconnect.haip资源无法启动,导致集群启动失败问题-1
(3)修改cluster_interconnects参数

--修改ASM实例的cluster_interconnects参数
--grid用户执行,修改具体的私网地址
SQL> alter system set cluster_interconnects='10.10.10.11' scope=spfile sid='+ASM1';
SQL> alter system set cluster_interconnects='10.10.10.12' scope=spfile sid='+ASM2';

--修改DB实例的cluster_interconnects参数
--oracle用户执行,修改具体的私网地址
SQL> alter system set cluster_interconnects='10.10.10.11' scope=spfile sid='sbhis1';
SQL> alter system set cluster_interconnects='10.10.10.12' scope=spfile sid='sbhis2';

(4)重启集群

# /u01/app/11.2.0/grid/bin/crsctl stop crs -f
# /u01/app/11.2.0/grid/bin/crsctl start crs

--重启完,检查HAIP是否禁用
$ crsctl stat res -t -init

若ora.cluster_interconnect.haip为offline则为禁用状态。然后执行如下命令:

ifconfig -a 或ip a |grep 169.254

查看是否还有169.254开头的地址,如果没有了,那么说明已经禁用成功。

相关文章

Oracle如何使用授予和撤销权限的语法和示例
Awesome Project: 探索 MatrixOrigin 云原生分布式数据库
下载丨66页PDF,云和恩墨技术通讯(2024年7月刊)
社区版oceanbase安装
Oracle 导出CSV工具-sqluldr2
ETL数据集成丨快速将MySQL数据迁移至Doris数据库

发布评论