【我和openGauss的故事】openGauss修改pg_hba导致节点无法启动及cm主备切换
一、状态正常
[omm@Euler1 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 172.16.220.45 1 /database/opengauss/cm/cm_server Primary
2 Euler2 172.16.220.201 2 /database/opengauss/cm/cm_server Standby
3 Euler3 172.16.220.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 172.16.220.45 6001 /database/opengauss/data P Primary Normal
2 Euler2 172.16.220.201 6002 /database/opengauss/data S Standby Normal
3 Euler3 172.16.220.221 6003 /database/opengauss/data S Standby Normal
二、修改pg_hba
[omm@Euler1 data]$ vi pg_hba.conf
host all all 172.16.221.6 sha256
host all all 172.16.221.118 sha256
三、关闭
[omm@Euler1 data]$ gs_om -t stop
Stopping cluster.
=========================================
Successfully stopped cluster.
=========================================
End stop cluster.
[omm@Euler1 data]$ gs_om -t start
Starting cluster.
======================================================================
^CTraceback (most recent call last):
File "/database/opengauss/tool/script/gs_om", line 837, in
main()
File "/database/opengauss/tool/script/gs_om", line 806, in main
impl.doStart()
File "/database/opengauss/tool/script/impl/om/OmImpl.py", line 88, in doStart
self.doStartCluster()
File "/database/opengauss/tool/script/impl/om/OLAP/OmImplOLAP.py", line 183, in doStartCluster
self.doStartClusterByCm()
File "/database/opengauss/tool/script/impl/om/OLAP/OmImplOLAP.py", line 169, in doStartClusterByCm
self.dataDir)
File "/database/opengauss/tool/script/gspylib/component/CM/CM_OLAP/CM_OLAP.py", line 279, in startCluster
result_set = CmdUtil.retryGetstatusoutput(cmd, retry_time=retry_times)
File "/database/opengauss/tool/script/base_utils/os/cmd_util.py", line 566, in retryGetstatusoutput
(status, output) = subprocess.getstatusoutput(cmd)
File "/usr/lib64/python3.7/subprocess.py", line 611, in getstatusoutput
data = check_output(cmd, shell=True, text=True, stderr=STDOUT)
File "/usr/lib64/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/usr/lib64/python3.7/subprocess.py", line 490, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib64/python3.7/subprocess.py", line 951, in communicate
stdout = self.stdout.read()
KeyboardInterrupt
重启,卡住
[omm@Euler1 data]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 172.16.220.45 1 /database/opengauss/cm/cm_server Primary
2 Euler2 172.16.220.201 2 /database/opengauss/cm/cm_server Standby
3 Euler3 172.16.220.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 172.16.220.45 6001 /database/opengauss/data P Pending Starting
2 Euler2 172.16.220.201 6002 /database/opengauss/data S Primary Normal
3 Euler3 172.16.220.221 6003 /database/opengauss/data S Standby Normal
其他节点已经启动,本节点一直处于Pending Starting
[omm@Euler1 data]$ gs_om -t stop -h Euler1
Stopping node.
=========================================
Successfully stopped node.
=========================================
End stop node.
[omm@Euler1 data]$ gs_om -t start -h Euler1
Starting node.
======================================================================
Successfully started node.
======================================================================
End start node.
Successfully started node.
[omm@Euler1 data]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 172.16.220.45 1 /database/opengauss/cm/cm_server Standby
2 Euler2 172.16.220.201 2 /database/opengauss/cm/cm_server Primary
3 Euler3 172.16.220.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 172.16.220.45 6001 /database/opengauss/data P Pending Starting
2 Euler2 172.16.220.201 6002 /database/opengauss/data S Primary Normal
3 Euler3 172.16.220.221 6003 /database/opengauss/data S Standby Normal
关闭、重启本节点,状态一直是Pending Starting,并且由于故障切换至另外节点
四、修复
仔细观察发现是pg_hba.conf中添加的ip地址格式错误,ip地址后面没有写掩码,修改完成.
[omm@Euler1 data]$ gs_om -t stop -h Euler1
Stopping node.
=========================================
Successfully stopped node.
=========================================
End stop node.
[omm@Euler1 data]$ gs_om -t start -h Euler1
Starting node.
======================================================================
Successfully started node.
======================================================================
End start node.
Successfully started node.
[omm@Euler1 data]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 172.16.220.45 1 /database/opengauss/cm/cm_server Standby
2 Euler2 172.16.220.201 2 /database/opengauss/cm/cm_server Primary
3 Euler3 172.16.220.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 172.16.220.45 6001 /database/opengauss/data P Standby Normal
2 Euler2 172.16.220.201 6002 /database/opengauss/data S Primary Normal
3 Euler3 172.16.220.221 6003 /database/opengauss/data S Standby Normal
修改完成后可以正常启动,切换回来。
[omm@Euler1 data]$ gs_ctl switchover -D /database/opengauss/data
[2023-07-13 14:05:22.942][2167163][][gs_ctl]: gs_ctl switchover ,datadir is /database/opengauss/data
[2023-07-13 14:05:22.942][2167163][][gs_ctl]: switchover term (1)
[2023-07-13 14:05:22.947][2167163][][gs_ctl]: waiting for server to switchover........
[2023-07-13 14:05:27.975][2167163][][gs_ctl]: done
[2023-07-13 14:05:27.975][2167163][][gs_ctl]: switchover completed (/database/opengauss/data)
[omm@Euler1 data]$ gs_om -t status -detail
[GAUSS-50000] : Unrecognized parameter: -d.
[omm@Euler1 data]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 172.16.220.45 1 /database/opengauss/cm/cm_server Standby
2 Euler2 172.16.220.201 2 /database/opengauss/cm/cm_server Primary
3 Euler3 172.16.220.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 172.16.220.45 6001 /database/opengauss/data P Primary Normal
2 Euler2 172.16.220.201 6002 /database/opengauss/data S Standby Normal
3 Euler3 172.16.220.221 6003 /database/opengauss/data S Standby Normal
[omm@Euler1 data]$ gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.
存在一个问题,Datanode切换成功,但是CMServer的Primary节点依然在2节点。
五、CM切换
目前没有发现可以手工切换cm的命令,cm_ctl依然是切换数据库主备命令,不过cm只是故障转移组件,不影响。对于强迫症患者,可以考虑将主节点全部切换到同一个节点或者在2节点复现上面的错误由cm自动去切换。
既然可以通过触发故障切换实现切换效果,那么当然也可以通过kill dn进程触发切换
[omm@Euler2 ~]$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
omm 1627 0.0 0.0 20056 9968 ? Ss Jun14 3:42 /usr/lib/systemd/systemd --user
omm 1629 0.0 0.0 24124 2812 ? S Jun14 0:00 (sd-pam)
omm 2422 0.4 0.0 18564 13608 ? S Jun14 172:21 /database/opengauss/app/bin/om_monitor -L /database/opengauss/log/omm/cm/om_monitor
omm 2396593 0.0 0.0 214116 3676 pts/0 S+ 11:53 0:00 -bash
omm 2611562 2.0 0.0 799340 24364 ? Sl 14:18 4:04 /database/opengauss/app/bin/cm_agent
omm 2611575 4.7 0.3 6505600 400692 ? Sl 14:18 9:23 /database/opengauss/app/bin/cm_server
omm 2611586 3.0 2.8 47986344 3774748 ? Sl 14:18 6:02 /database/opengauss/app/bin/gaussdb -D /database/opengauss/data -M pending
omm 2611593 0.0 0.0 1401088 75952 ? Sl 14:18 0:00 gaussdb fenced UDF master process
omm 2908296 0.0 0.0 15020 4772 ? S 17:37 0:00 sshd: omm@pts/1
omm 2908297 0.0 0.0 214088 3752 pts/1 Ss 17:37 0:00 -bash
omm 2908380 0.0 0.0 215868 3228 pts/1 R+ 17:37 0:00 ps ux
[omm@Euler2 ~]$ kill -9 2611575
在二节点找到cm_server进程,kill进程,kill后om会自动重新拉起
[omm@Euler2 ~]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
--------------------------------------------------------------------------------
1 Euler1 10.236.160.45 1 /database/opengauss/cm/cm_server Primary
2 Euler2 10.236.160.201 2 /database/opengauss/cm/cm_server Standby
3 Euler3 10.236.160.221 3 /database/opengauss/cm/cm_server Standby
[ Cluster State ]
cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 Euler1 10.236.160.45 6001 /database/opengauss/data P Primary Normal
2 Euler2 10.236.160.201 6002 /database/opengauss/data S Standby Normal
3 Euler3 10.236.160.221 6003 /database/opengauss/data S Standby Normal
复制
此时发现cm主备节点已经更换。
人划线