What is fencing?
The Red Hat High Availability Add-on uses fencing to ensure data integrity in the cluster. Fencing is often accomplished by powering the node off since a dead node is clearly not able to do anything. In other cases, a combination of operations will be used to cut the node off from the network (to stop new work from arriving) or from storage (to stop the node from writing to shared storage). Fencing is a necessary step in service and resource recovery in a cluster. The Red Hat High Availability Add-on will not start resource and service recovery for a nonresponsive node until that node has been fenced.
Cluster operation without fencing
Without fencing, data integrity on shared storage resources cannot be guaranteed. In a threenode cluster consisting of nodes A, B, and c, there are no fencing devices configured. Node A has an ext4 file system mounted from shared storage and is running a web server serving pages from that file system. If node A stops responding on the network, the following chain of events is triggered:
- Node B mounts the file system from shared storage after performing a quick file system check.
- Node B starts the web service.
- Node A wakes up again and continues writing to the same ext4 file system that is mounted on node Bas well.
- File system corruption ensues.
Cluster operation with fencing
In order to stop node A from accessing the file system, and thus causing file system corruption, after node B has taken over the resource, it must be ensured that node A will no longer access this file system before another node attempts to mount the file system. This procedure is called fencing.
With fencing configured, the chain of events would be slightly different:
- Node Band node C cut off node A from storage.
- Node B mounts the file system from shared storage after performing a quick file system check.
- Node B starts the web service.
- Node A wakes up again and attempts to write to the mounted file system. This fails since node A can no longer access the shared storage resource. Or Node A is rebooted and comes up cleanly, joining the cluster.
Fencing mechanism overview
There are two main methods of fencing: power fencing, also known as Shoot The Other Node In The Head (STONITH), and fabric fencing. Both fencing methods require a fence device, such as a power switch or the virtual fencing daemon and fencing agent software to enable communication between the cluster and the fencing device. The fencing agent communicates when a particular node should be fenced.
Power fencing
Power fencing entails cutting off power to a server. This fencing method is called STONITH, short for Shoot The Other Node In The Head. Two different kinds of power fencing devices exist:
- External fencing hardware that cuts off the power, such as a network-controlled power strip.
- Internal fencing hardware, such as ILO, DRAC, IPMI, or virtual machine fencing, that powers off the hardware of the node.
Power fencing can be configured to turn the target machine off and keep it off, or to turn it off and then on again. Turning a machine back on has the added benefit that the machine should come back up cleanly and rejoin the cluster if the cluster services have been enabled.
The following graphic shows an example of power fencing using a network-controlled power controller and two power supplies in a server.
Fabric fencing
Fabric fencing (SCSI fencing) entails disconnecting a machine from storage at the storage level. This can be done by closing ports on a Fibre Channel switch, or by using SCSI reservations. If a machine is fenced only using fabric fencing and not in combination with power fencing, it is the administrator’s responsibility to make sure that the machine will join the cluster again. This is usually done by rebooting or power-cycling the failed node.
The following graphic shows an example of fabric fencing using multipathed Fibre Channel storage.
Combining fencing methods
Fencing methods can be combined. When a node needs to be fenced, one fence device can cut off Fibre Channel by blocking ports on a Fibre Channel switch, and an ILO card can then power cycle the offending machine. Multiple fencing methods can act as a backup for each other. For example, the cluster nodes are first fenced by power fencing, but if that fails, with fabric fencing.
Setting Up Fencing Devices in a Pacemaker Cluster
Configuring Cluster Fencing Agents in a Pacemaker Cluster