A key aspect of Red Hat cluster design is that a system must be configured with at least one fencing device to ensure that the services that the cluster provides remain available when a node in the cluster encounters a problem. Fencing is the mechanism that the cluster uses to resolve issues and failures that occur. When you design your cluster services to take advantage of fencing, you can ensure that a problematic cluster node will be cut off quickly and the remaining nodes in the cluster can take over those services, making for a more resilient and stable cluster.
What is Fencing
If communication with a single node in the cluster fails, then other nodes in the cluster must be able to restrict or release access to resources that the failed cluster node may have access to. This cannot be accomplished by contacting the cluster node itself as the cluster node may not be responsive. Instead, you must provide an external method, which is called fencing with a fence agent. A fence device is an external device that can be used by the cluster to restrict access to shared resources by an errant node or to issue a hard reboot on the cluster node.
Without a fence device configured, you do not have a way to know that the resources previously used by the disconnected cluster node have been released, and this could prevent the services from running on any of the other cluster nodes. Conversely, the system may assume erroneously that the cluster node has released its resources and this can lead to data corruption and data loss. Without a fence device configured, data integrity cannot be guaranteed and the cluster configuration will be unsupported.
When the fencing is in progress no other cluster operation is allowed to run. This includes failing over services or granting new locks for GFS filesystems or GFS2 filesystems. Normal operation of the cluster cannot resume until fencing has completed or the cluster node rejoins the cluster after the cluster node has been rebooted.
Failure Scenarios and Fencing
There are many failure scenarios in which properly configured fencing can help ensure that your cluster service remains operational when the nodes in a cluster lose communication with each other, whether the loss of communication is due to a node going down or to losing network connection.
gfs2 file system in a cluster
If a node in a cluster running a gfs2 file system loses the ability to communicate with the other nodes in a cluster, then that node can access data without being able to communicate locks to the other cluster nodes. If the cluster does not successfully fence that node, the nodes will not be able to respect the boundaries of what the other nodes are accessing, which is a situation that can lead to data corruption.
single-node file system in a cluster
If a node in a cluster that is accessing a single-node file system loses cluster communication, that node may still be holding on to resources that are needed by a cluster service. Fencing that node allows you to recover those resources quickly and keep your service running.
shared IP address in a cluster
If you are hosting a shared IP address through which your applications access a high availability resource, the IP address is active on one node. If that node loses the ability to communicate with the other nodes in the cluster, another node will attempt to recover the resources that were running on that node and will start up the IP address, creating an IP conflict. An IP conflict can result in undefined behavior, interrupting the cluster service and causing an application to hang. Successfully fencing the node prevents this conflict of two nodes using the same IP address on the network.
If a node needs to be removed from the cluster for any reason, in order for another node to provide a service that the node holds it must reclaim the resources from that node. There is no guarantee that the service will be able to run on the new node unless the original node is fenced, freeing the resources for another node. Fencing provides a way to resolve the resource conflicts that can result when a node becomes problematic.
When a node becomes unresponsive and an entire application fails over to another node if the first node has not been fenced it may not have relinquished the application and both nodes will try to run the application at the same time. This could, for example, cause two nodes to try to access a database at the same time.
Even in a situation where you have a database resource that is external to your node or in another cluster but the application is running in your cluster, if the nodes in your cluster are not communicating with each other and you do not successfully fence a node, the application can be running in two nodes of your cluster. Each node may then submit requests to the application independently. This may not result in hard corruption but can yield inconsistent data.
degraded cluster service
In some situations, a node may become unresponsive because of resource starvation on that node. The node may be servicing a high load when an application causes the node to run out of memory. The node may still be running the application but in a degraded fashion. In this case, it is important that you fence the node to keep the service running at full capacity. If fencing fails, you may not be alerted to a problem and the node will continue to try to run the application.
If your storage fails or is unresponsive for some reason, operations that are occurring on that node are going to block and be unresponsive. I/O operations on that device may not return for many minutes, or perhaps at all. Attempts to unmount a shared file system may not take place. Even though in this case the node is still running the application, it is important to fence that node in order to enable other nodes to provide that service. If you configure your system with storage fencing, you can cut off a node’s access to the storage which should address the blocking, freeing up the resources and taking them out of any blocking state.
General Importance of Fencing To Maintaining Cluster Services
In sum, a Red Hat High Availability Add-On cluster relies on fencing to ensure that the cluster services remain available when a node in the cluster becomes problematic. Fencing is one of the diagnostic tools that the cluster uses to evaluate the services being provided and to ensure that those services can be moved to another node immediately when an issue is detected. When a node is fenced, it indicates that there is an underlying issue that needs to be addressed. Issues that require fencing can arise at any time during a system’s implementation, even many months after initial configuration.
Fence Device Types
A fence agent (or device) is an external device that can be used by the cluster to restrict access to shared resources by an errant node (or hard reboot the cluster node). The two most common types of fencing are:
- Power fence agents: The cluster software logs in via telnet, ssh, or SNMP to the device such as an APC switch, Dell DRAC, HP iLO, IBM RSA, or similar device and turns off (and optionally on) the power for the cluster node. This method will execute a hard “off” action. Some fence agents will require that acpi be disabled.
- I/O fence agents: The cluster software logs in to a fiber channel switch via telnet or ssh and disables the port(s) for that node, thereby cutting of its access to shared storage. This method requires that an administrator manually reboot or shutdown the errant node to recover it and log in to the switch interface to re-enable the appropriate port(s). This can also be achieved via SCSI reservation fencing.
The following requirements should be observed when configuring a cluster:
- A fence device is required for all cluster nodes.
- The fence agent fence_manual is an unsupported fencing device that should only be used for testing.
- The cluster nodes must be able to access all fence devices regardless of the power state of the cluster node that will be fenced. For example, if the cluster node that will be fenced is using an onboard system management card such as (ILO, drac, rsa, etc) then that fence device must have an external power source so that when the machine does not have power the fence device does still have power.
- For RHEL 6 and RHEL 7 High Availability Clusters with a pacemaker, you must have the cluster property stonith-enabled set to true. Even with devices configured, if stonith-enabled is false, these devices will not be used and the configuration is functionally equivalent to not having any fence devices at all.
Fence Configuration Testing Overview
Fencing is a fundamental part of the Red Hat Cluster infrastructure and it is therefore important to validate or test that fencing is working properly and that there is enough redundancy such as multiple fence devices configured and resiliency (reliable and available fencing mechanism) in place. Red Hat recommends that a secondary fence is used.
This section provides a checklist of general guidelines to follow to test your fence device configuration.
– Test your configuration as you go. After each configuration command, you should run a status command to ensure that you can connect to the device you created and that it is set up properly.
– Check for any errors that have occurred that are displayed by the pcs status command. It is common to see a timeout status. This can occur if you are configuring multiple devices at the same time, or if multiple nodes are starting at the same time, and this error is often a temporary error. You should check the error code, however, to determine if there might be something more that is wrong. If you check status as you go and see the error as soon as it occurs, then you know the following things:
- The error was for the fence device you have just configured.
- The error was for a start operation
- The error was for the device running on this node
From this information, you can sometimes determine how to address the error.
– After ensuring that your fence device is running, panic your machine and check whether the node was fenced.
# echo c > /proc/sysrq-trigger
– If this does not fence the node, check your fence configuration. For example, if you have used a host map you should ensure that the system can find the node using the hostname you have provided.
– Take down a network to see what happens with your node. How you would take a network down will depend on your specific configuration.
If fencing is not successful, check for the following common configuration errors:
– Check whether the password and user name for the device include any special characters that could be misinterpreted by the bash shell. Making sure that you enter passwords and usernames surrounded by quotation marks could address this issue.
– Check whether you can connect to the device using the exact IP address or hostname you specified in the pcs stonith command. For example, if you give the hostname in the stonith command but test by using the IP address, that is not a valid test.
– If the protocol that your fence device uses is accessible to you, use that protocol to try to connect to the device. For example, many agents use ssh or telnet. You should try to connect to the device with the credentials you provided when configuring the device, to see if you get a valid prompt and can log in to the device.
If you determine that all your parameters are appropriate but you still have trouble connecting to your fence device, you can check the logging on the fence device itself, if the device provides that, which will show if the user has connected and what command the user issue. You can also search through the /var/log/messages file for instances of stonith and error, which could give some idea of what is transpiring, but some agents can provide additional information.