CRS is Not Starting has a disk HB, but no network HB, DHB has rcfg


Oracle CRS is not starting "has a disk HB, but no network HB, DHB has rcfg..." in ocssd log


Node2 is Terminated and cluster services not Up automatically. so when we trying to Start cluster services in Node2 found the Below error in ocssd.log file

2018-04-01 00:03:27.519: [    CSSD][1025612096]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk
2018-04-01 00:03:28.017: [    CSSD][1020881216]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2018-04-01 00:03:28.050: [    CSSD][1016133952]clssnmvDHBValidateNCopy: node 1, act-racnode01, has a disk HB, but no network HB, DHB has rcfg     299789247, wrtcnt, 353554863, LATS 492944, lastSeqNo 353554860, uniqueness 1520021542, timestamp 1522521207/2570305164

If you look CRS check you'll see following:
[root@<node2> <node2>]# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
So we came to know this is Due to Interconnectivity Failure between Node2 and Node1
You can check Ping and SSH between nodes via interconnect interface.
If they are not working then there is problem in network connection between cluster nodes. Fix the problem and CRS will start correctly. 

But if Ping and SSH did work between nodes via interconnect interface and still ocssd log did complain about interconnect HeartBeat (no network HB) then interconnect interface is jammed. You can try to restart it to get it fixed (NOTE! It is usually the working node interconnect interface that is needed to restart (like error message is saying in ocssd.log (it is complaining node1)). For example if node2 CRS is not starting then restart node1 interconnect interface ) :
In My Case the , ssh and Ping with Private IP Is Working Fine. So we came to know the Interconnect is Jammed and we went with Restart of Private interconnect.
We need to perform the below commands on Node which is running Successfully . In My case it is Noe1. Node 1 is running with out issues . So I Restarted eth1 network in Node1.
[root@<node1> <node1>]# ifdown eth1
[root@<node1> <node1>]# ifup eth1
And check that eth1 is looking ok:
[root@<node1> <node1>]# ifconfig
After interface restart that node2 clusterware is starting again:

Comments

Post a Comment