Resources can fail for multiple reasons. An administrator might have used incorrect settings when defining the resource, a configuration file might have an error in it, the system might be trying to start a resource that does not exist, or some other unforeseen issue might occur.
Whenever a resource fails, the cluster will increase the failcount for a resource. This count can be viewed with the command:
# pcs resource failcount show [RESOURCE]
Failure to start or stop will immediately set the failcount for a resource to INFINITY, forcing it to move to a different node. If fencing is enabled, a node that failed to stop a resource will also be fenced.
Resources can be configured to relocate to a different node after N amount of failures as well, by setting the option meta migration-threshold=N when creating or modifying the resource. By default. resources will not migrate unless their failcount reaches INFINITY.
Troubleshooting resource failures
There are a number of steps an administrator can take when troubleshooting a resource failure:
- Inspect the log files for the affected resources.
- Inspect the cluster log files.
- Verify resource configuration.
- Verify configuration files.
- Attempt to manually start the resource with debug-start.
Inspect log files
One of the first actions an administrator will likely take is inspecting the log files for both the cluster itself, and any log files the affected resource might generate. Be careful when reading these log files, as the actual error might be a single small warning, while the resulting failure cascade following the first error mi9ht generate far more logging.
Verify resource configuration
Using the command pcs resource show –full or pcs resource show [RESOURCE], inspect the cluster configuration for the failed resource. Small typos here, or missed options, might have a drastic effect.
Verify configuration files
If the affected resource has its own configuration file, like an apache resource, it might also have its own configuration file validation tool, like apacectl configtest. Ensuring that the resource can start eliminates a whole slew of potential failures.
Manually starting a resource with debug-start
When a resource has reached a failcount of INFINITY on all nodes, it is no longer possible to attempt to start that resource automatically. An administrator can still attempt to start the resource with the command pcs resource debug-start [RESOURCE]. This will result in a short status output on both failure and success. Adding in the –full option as well will generate full debugging output, which can assist troubleshooting.
Fixing resource failures
Updating a cluster resource definition will automatically reset the failcount for that resource, enabling its use on the cluster again.
When the performed fix was performed outside of the cluster configuration -for example, by updating a service configuration file – the failcount will remain, preventing the resource from being started. In those cases, an administrator can run the command pcs resource failcount reset [RESOURCE] to manually reset the failcount, enabling the resource.
# pcs resource failcount reset [RESOURCE]