What is a “Split Brain”
Split Brain is often used to describe the scenario when two or more nodes in a cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources. In simple terms “Split brain” means that there are 2 or more distinct sets of nodes, or “cohorts”, with no communication between the two cohorts.
For example :
Suppose there are 3 nodes in the following situation.
1. Nodes 1,2 can talk to each other.
2. But 1 and 2 cannot talk to 3, and vice versa.
Then there are two cohorts: {1, 2} and {3}.
Why is this a problem
The biggest risk following a Split-Brain event is the potential for corrupting system state. There are three typical causes of corruption:
1. The processes that were once co-operating prior to the Split-Brain event occurring, independently modify the same logically shared state, thus leading to conflicting views of system state. This is often called the “multi-master problem”.
2. New requests are accepted after the Split-Brain event and then performed on potentially corrupted system state (thus potentially corrupting system state even further).
3. When the processes of the distributed system “rejoin” together it is possible that they have conflicting views of system state or resource ownerships. During the process of resolving conflicts, information may be lost or become corrupted.
In simpler terms, in a split-brain situation, there are in a sense two (or more) separate clusters working on the same shared storage. This has the potential for data corruption.
How does clusterware resolve a “split brain” situation?
In a split brain situation, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:
a. The group(cohort) with more cluster nodes survive
b. The group(cohort) with lower node member survive, in case of same number of node(s) available in each group.
c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.
Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. [ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2) [ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: ################################### [ CSSD]2015-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting ###################################
Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.
To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.
Common messages in instance alert log are similar to:
alert log of instance 1: --------- Mon Dec 07 19:43:05 2011 IPC Send timeout detected.Sender: ospid 26318 Receiver: inst 2 binc 554466600 ospid 29940 IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20 Mon Dec 07 19:43:07 2011 Communications reconfiguration: instance_number 2 Mon Dec 07 19:43:07 2011 Trace dumping is performing id=[cdmp_20091207194307] Waiting for clusterware split-brain resolution Mon Dec 07 19:53:07 2011 Evicting instance 2 from cluster Waiting for instances to leave: 2 ...
alert log of instance 2: --------- Mon Dec 07 19:42:18 2011 IPC Send timeout detected. Receiver ospid 29940 Mon Dec 07 19:42:18 2011 Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc: Trace dumping is performing id=[cdmp_20091207194307] Mon Dec 07 19:42:20 2011 Waiting for clusterware split-brain resolution Mon Dec 07 19:44:45 2011 ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1 Mon Dec 07 19:44:51 2011 ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1 Mon Dec 07 19:45:38 2011 ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1 Mon Dec 07 19:52:27 2011 Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc (incident=90153): ORA-29740: evicted by member 0, group incarnation 10 Incident details in: /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc
In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout.