What is Split-Brain?
The term “Split-Brain” is often used to describe the scenario when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.
What does “co-operating” mean?
Co-operating processes are those that use shared or otherwise related resources, including accessing or modifying shared system state, during the process of performing some coordinated action, typically at the request of a client.
What’s at risk?
The biggest risk following a Split-Brain event is the potential for corrupting system state. There are three typical causes of corruption:
- The processes that were once co-operating prior to the Split-Brain event occurring independently modify the same logically shared state, thus leading to conflicting views of the system state. This is often called the “multi-master problem“.
- New requests are accepted after the Split-Brain event and then performed on a potentially corrupted system state (thus potentially corrupting system state even further).
- When the processes of the distributed system “rejoin” together it is possible that they have conflicting views of system state or resource ownerships. During the process of resolving conflicts, information may be lost or become corrupted.
Examples of potential corruption include creating multiple copies of the same information, updating the same information multiple times, deleting information, creating multiple events for a single operation, processing an event multiple times, starting duplicate services, or suspending existing services.
What if the processes aren’t co-operating?
If the processes in a distributed system aren’t co-operating in any way (ie: they don’t use any shared resources), Split-Brain, or its effects, may not occur.
I thought Split-Brain was all about physical network infrastructure and connectivity failures. Can it really occur at the process level, on the same physical server?
While network infrastructure failure is one of the more common causes of Split-Brain, the loss of communication or connectivity between two or more processes on a single physical server, even running on a single processor, may also cause a Split-Brain event.
For example; if one of two co-operating processes on a server are swapped out for a long period of time, longer than the configured network or connectivity time-out between the processes, a Split-Brain may occur if each process continues to operate independently, especially when the swapped out process returns to normal operation. ie: The swapped process does not take into account that it has been unavailable for a long period of time.
Similarly, if a process is interrupted for a long period of time, say due to an unusually long Garbage Collection or when a physical processor is unavailable for a process to due heavy contention virtualized infrastructure, the said process may not respond to communication requests from another process, and thus a Split-Brain may occur.
Split-Brain does not require physical network infrastructure to occur.
Garbage Collection can’t cause a Split-Brain. Right?
Unfortunately not. As explained above, excessively long Garbage Collection or regular back-to-back Garbage Collections may make a process seen unavailable to other processes in a distributed system and thus not be in a position to respond to communication requests.
Split-Brain will only ever occur when a system has two processes. With three or more processes it can easily be detected and resolved. Right?
Unfortunately not. Split-Brain scenarios may just as easily occur when there are n processes (where n > 2) in a distributed system. For example, it’s possible that all n processes in a distributed system, especially on heavily loaded hardware, could be swapped out at the same point in time, thus effectively losing connectivity with each other, thus potentially creating an n-way Split-Brain.
As Split-Brains occur as the result of incorrectly performing some action due to an observed communication failure, the number of “pieces” of “brains” may be greater than two.
Splits always occur “down-the-middle”. Right?
Unfortunately not. Following on from above, as most distributed systems consist of large numbers of processes (large values of n), splits rarely occur “down-the-middle”, especially when n is an odd number! In fact, even when n is an even number, there is absolutely no guarantee that a split will contain two equally sized collections of processes. For example, a system consisting of five processes may be split such that one side of the system (ie: brain) may have three processes and the other side may have two processes. Alternatively, it may be split such that one side has four processes and the other has just one process. In a system with six processes, a split may occur with four processes on one side and just two on another.
Stateless architectures don’t suffer from Split-Brain. Right?
Correct. Systems that don’t have a shared state or use shared resources typically don’t suffer from Split-Brain.
Client+Server architectures don’t suffer from Split-Brain. Right?
Correct – if and only if the Server component of the architecture operates as a single process. However, if the Server component of architecture operates as a collection of processes, it’s possible that such an architecture will suffer from Split-Brain.
All Stateful architectures suffer from Split-Brain. Right?
The “statefulness” of architecture does not imply that it will suffer from Split-Brain. It is very possible to define an architecture that is stateful and yet avoids the possibility of Split-Brain (as defined above), by ensuring no shared resources are accessed across processes.
The solution is simple – completely avoid Split-Brains by waiting longer for communication to recover, doing more checks, and avoiding assumptions. Right?
Unfortunately in most Split-Brain scenarios, all that can be observed is an inability to send and/or receive information. It is from these observations that systems must make assumptions about a failure. When these assumptions are incorrect, Split-Brain may occur.
While waiting longer for a communication response may seem like a reasonable solution, the challenge is not in waiting. The waiting part is easy. The challenge is determining “how long to wait” or “how many times to retry”.
Unfortunately to determine “how long to wait” or “retry” we need to make some assumptions about process connectivity and ultimately process availability. The challenge here is that those assumptions may quickly become invalid, especially in a dynamically or arbitrarily loaded distributed system. Alternatively, if there is a sudden spike in the number of requests, processes may pause more frequently (especially in the case of Garbage Collection or in virtualized environments) and thus increase the potential for a Split-Brain scenario to form.
Split-Brain only occurs in systems that use unreliable network protocols (ie: protocols other than TCP/IP). Right?
The network protocol used by a distributed system, whether a reliable protocol like TCP/IP or unreliable protocol like UDP (unicast or multicast) does not preclude a Split-Brain from occurring. As discussed above, a physical network is not required for a Split-Brain scenario to develop.
TCP/IP decreases the chances of a Split-Brain occurring. Right?
Unfortunately not. Following on from above, the operational challenge with the use of TCP/IP within a distributed system is that the protocol does not directly expose internal communication failures to the system, like for example the ability to send from a process and not receive (often called “deafness”) or the ability to receive but not send (often called “muteness”). Instead, TCP/IP protocol failures are only notified after a fixed, typically operating system-designated time-out (usually configurable but typically set to seconds or even minutes by default). However, the use of unreliable protocols such as UDP may highlight the communication problems sooner, including the ability to detect “deafness” or “muteness” and thus allow a system to take corrective action soon.
Having all processes in a distributed system connected via a single physical switch will help prevent Split-Brain. Right?
While deploying a distributed system such that all processes are interconnected via a single physical switch may seem to reduce the chances of a Split-Brain occurring, the possibility of a switch failing atomically at once is extremely low. Typically when a switch fails, it does so in an unreliable and degrading manner. That is, some components of a switch will continue to remain operational whereas others may be intermittent. Thus in their entirety, switches become intermittent, before they fail completely (or are shutdown completely for maintenance).
While a switch remains intermittent, the chances of a Split-Brain event occurring will be increased.
Can Split-Brain be solved?
Unfortunately, there is no generally applicable solution to Split-Brains. Essentially Split-Brains are an unsolvable artifact of all co-operating processes in distributed systems.
What are the best practices for dealing with Split-Brain?
While the problem of Split-Brains can’t be solved using a generalized approach, “prevention” and “cure” are possible.
There are essentially six approaches that may be used to prevent a Split-Brain from occurring;
- Use high quality and reliable network infrastructure.
- Provide multiple paths of communication between processes, so that a single observed communication failure does not trigger a Split-Brain.
- Avoid overloading physical resources so that processes are not swapped out for long periods of time.
- Avoid unexpectedly long Garbage Collection pauses.
- Ensure communication time-outs are suitably long enough to prevent a Split-Brain occurring “too early” due to (3) or (4)
- Architect an application so that it uses few shared resources (this is rarely possible).
However even implementing all of the approaches, Split-Brain may still occur. In which case we need to focus on “cure”. There are essentially four approaches to “curing” a Split-Brain:
- Fail-Fast: As soon as a Split-Brain scenario is detected, the entire system or suitable processes in the system are immediately shutdown to avoid the possibility of corruption.
- Isolation is weak form of Fail-Fast. Instead of shutting down processes of a system, they are simply isolated. When the Split-Brain is recovered, the isolated processes are re-introduced into the system as if they were new (ie: they drop any previously held information/assumptions to avoid corruption). After the Split-Brain event has occurred, the Isolated processes may continue to perform work, but on re-introduction to the system, currently, information/processing will be lost.
- Fencing is a stronger form of isolation, but still a weaker form of Fail-Fast. Fencing requires an additional constraint over Isolation in that fenced-processes must stop execution immediately (and release resources), rather than continue to operate after a Split-Brain has occurred.
- Resolve Conflicts (assumes the above two approaches have not be used)
When the communication channels have been recovered, it’s highly possible that there are conflicts at the resource level – including conflicts in the system state. By providing a “Conflict Resolving” interface, developers may provide an application-specific mechanism for resolving said conflicts, thus allowing a system to continue to operation. Of course, this is completely application dependent and development intensive, but provides the best way to recover.
How can a Split-Brain event be detected? How are they defined?
Unfortunately there is no simple way to define or detect when a Split-Brain has occurred. While it’s fairly obvious and perhaps trivial in a system composed of only two processes, these are increasingly rare. Most distributed systems have 10’s if not 1000’s of processes. For example, if four processes in a five process system collectively lose contact with a single process, does that mean a four-to-one Split-Brain has occurred, or simply that a single process has failed?
The common (and unfortunately often naive) solution to this problem to define what is called a Quorum, the idea being, those processes not belonging to the quorum should Fail-Fast (ie: be terminated or terminate themselves) or be Isolated. The typically way to define a quorum is to specify the minimum number of co-operating processes must be collectively available to continue operating. Hence for the above example, with a quorum of “three processes”, the system would treat a failure of a single process not as a Split-Brain, but simply as a lost process. However if the quorum were incorrectly defined to be “a single process”, a Split-Brain would occur – with a four to one split.
Obviously depending on size-based quorums may be problematic as it’s yet another assumption we need to make – “how big should a quorum be?”. Often however the definition of a quorum is less about the number of processes that are collectively available, but instead more about the roles or locality of the said processes.
For example, if three processes in a five process system collectively lose contact with two other processes, but those two processes remain in contact, we essentially have a three-to-two Split-Brain. With a quorum that is defined using a “size-based” approach, the two processes may be failed-fast/isolated. However let’s consider that the three processes have also lost connectivity with a valuable resource (say a database or network attached storage device), but the other two processes have not. In this situation it’s often preferable that the two surviving processes should remain available and the other three processes should be failed-fast/isolated.
Further, and as previously mentioned, it’s important to remember that failures in communication between processes are rarely observed to occur at the same time. Rather they are observed over a period of time, perhaps seconds, minutes or even hours. Thus when discussing “when” a Split-Brain occurs, we usually need to consider the entire period of time, during which there may be multiple failures and recovery events, to fully conclude that a Split-Brain has occurred.
How does Oracle Coherence deal with Split-Brain?
Internally Oracle Coherence uses a variety of both proprietary (UDP uni-cast and multi-cast based) and standard (TCP/IP) network technologies for inter-process communication and maintaining system health. These technologies are combined in multiple ways to enable highly scalable and high-performance one-to-one and one-to-many communication channels to be established and reliably observed, across hundreds (even thousands if you really like) of processes, with little CPU or network overhead.
For example, through the combined use of these technologies Coherence can easily detect and appropriately deal with remote-garbage-collection across a system sub-second (using commodity 1Gb switch infrastructure). The major philosophy and delivered advantages of this approach is to ideally “prevent” a Split-Brain occurring as much as possible. Should a Split-Brain occur, say due to a switch failure, Coherence does the following:
- Uses Isolation to ensure reliable system communication between the individual fragments of the Split-Brain. Without reliable communication, data integrity within the fragments can not be guaranteed (or recovered)
- Uses Fencing around processes that are deemed unresponsive (ie: deaf), but continue to communicate with other processes in the system. This is a form of dynamic blacklisting to ensure system-wide communication reliability and prevent Fenced processes corrupting state.
- Raises real-time programmatic events concerning process health. This enables developers to provide custom Fail-Fast algorithms based on system health changes. In Coherence these events are handled using MemberListeners.
- Raises real-time programmatic events concerning the locality and availability of partitions (ie: information). This enables developers to provide custom Fail-Fast algorithms based on partitions being move or becoming unavailable due to catastrophic system failure. In Coherence these events may be handled using Backing Map Listeners and/or Partition Listeners.
- Uses a simple “largest side wins” size-based quorum rule to resolve resource (and state) ownership when a Split-Brain is recovered.