RHEL 6 Cluster Suite

This article is supposed to give a rather basic and understandable overview of RHCS in the context of HA (failover) clusters, that is provided by High Availability Add-On for RHEL 6, and also some handful troubleshooting and administration tips using the command line utilities. I tried to stay away from unnecessary and too much detailed information.

Introduction

HA clusters provide highly available services by eliminating SPOF’s and by failing over services from one cluster node to another in case a node becomes inoperative.

The RedHat’s High Availability Add-On consists of the following major components:

  • Cluster Infrastructure (provides fundamental functions such as configuration, membership, lock management and fencing)
  • Service Management (provides failover of services)
  • Administration Tools

Cluster Infrastructure

cman

CMAN manages cluster quorum and cluster membership. It runs on each node and keeps track of cluster quorum by monitoring the count of cluster nodes, and keeps track of membership by monitoring messages from other cluster nodes. The algorithm that is used to compute the quorum is simple majority – more than half the nodes in cluster should be online and in communication, in order for quorum to exist. In clustering, if quorum doesn’t exist, the service can not be provided. Also, quorum prevents a serious condition that could lead to file system inconsistency – a split-brain. Read about Two Generals’ Problem thought experiment and Paxos algorithm (The most widely known algorithm that solves consensus is Paxos, which can tolerate failure of up to half of the participating nodes). The important thing is that quorum actually can not prevent a split-brain scenario in a literal meaning, but rather decides who is dominant and allowed to function in a cluster. Should split-brain occur, quorum prevents one cluster group from doing anything.

Quorum is determined by communication of messages among cluster nodes via Ethernet. There are some additional checks that can be done prior determining whether to fence a network disconnected node (Quorum Disc, Tie-Breakers).

Corosync is used as the cluster communication layer (as opposed to openais from RHEL 5.x).

fenced

Fencing is the disconnection of a node from the cluster’s shared storage. It is an ultimate cure for the Split-brain condition. Fencing cuts off I/O from shared storage, thus ensuring data integrity. The cluster infrastructure performs fencing through the fence daemon, fenced. When CMAN determines that a node has failed, it communicates to other cluster-infrastructure components that the node has failed. fenced, when notified of the failure, fences the failed node. Other cluster-infrastructure components determine what actions to take — that is, they perform any recovery that needs to done. For example, DLM and GFS2, when notified of a node failure, suspend activity until they detect that fenced has completed fencing the failed node.

Upon confirmation that the failed node is fenced, DLM and GFS2 perform recovery. DLM releases locks of the failed node; GFS2 recovers the journal of the failed node.

Two key elements in the cluster configuration file define a fencing method: fencing agent and fencing device. The fencing program makes a call to a fencing agent specified in the cluster configuration file. The fencing agent, in turn, fences the node via a fencing device. When fencing is complete, the fencing program notifies the cluster manager.

The High Availability Add-On provides a variety of fencing methods:

  • Power fencing — A fencing method that uses a power controller to power off an inoperable node
  • Fibre Channel switch fencing — A fencing method that disables the Fibre Channel port that connects storage to an inoperable node.
  • Other fencing — Several other fencing methods that disable I/O or power of an inoperable node, including IBM Bladecenters, PAP, DRAC/MC, HP ILO, IPMI, IBM RSA II, and others.

The way in which a fencing method is specified depends on if a node has either dual power supplies or multiple paths to storage. If a node has dual power supplies, then the fencing method for the node must specify at least two fencing devices — one fencing device for each power supply.

dlm_controld

Lock management is a common cluster-infrastructure service that provides a mechanism for other cluster infrastructure components to synchronize their access to shared resources. GFS2 and CLVM use locks from the lock manager. rgmanager uses DLM to synchronize service states.

Configuration Management

The cluster configuration file is located at /etc/cluster/cluster.conf. The configuration file is an XML file that describes the following cluster characteristics:

  • Cluster name – cluster name, cluster.conf revision level, basic fence timing properties
  • Cluster – node, node name, node ID, number of quoting votes, fencing method for that node
  • Fence device – fence device with parameters such as IP, user, pass…
  • Managed resources – specifies resources required to create cluster services – failover domains, resources (IP for example) and services.

 

Service Management

rgmanager

rgmanager implements cold failover for off-the-shelf (not need to be customized) applications. It allows administrators to define, configure, and monitor cluster services. A cluster service comprises cluster resources which are building blocks that you create and manage in the cluster configuration file — for example, an IP address, an application initialization script, or a GFS2 shared partition. In the event of a node failure, rgmanager will relocate the clustered service to another node with minimal service disruption.

There are various processes and agents that combine to make rgmanager work.

You can associate a cluster service with a failover domain. A failover domain is a subset of cluster nodes that are eligible to run a particular cluster service.

There are five service operations, options to clusvcadm command that administrator may call to apply one of the following actions:

  • enable (start the service. If start fails it will relocate)
  • disable (stop the service, place in disabled state. Only permissible if service is in failed state)
  • relocate (move the service to another node)
  • stop (stop the service, place in stopped state)
  • migrate (for virtual machines)

There are five service states in which the service can be:

  • disabled (until administrator re-enables the service)
  • failed (the service is presumed dead, usually when stop operation failed. The Administrator must check that there are no any allocated resources (mounted file systems for example) prior to issuing disable request)
  • stopped (just a temporary measure)
  • recovering (the cluster is trying to recover the service. This may be stopped by disabling the service)
  • started

Administration Tools

Conga

Conga is a user interface for installing, configuring and managing clusters. It has two components:

  • luci – application server that provides web interface
  • ricci – software daemon that manages the distribution of cluster configuration. In RHEL 6 ricci replaces ccsd. Users define the configuration using Luci interface, and it is passed to corosync for distribution to cluster nodes. ricci must be run on every cluster node.

Command Line Tools

There are a few handy CLI tools at administrator’s disposal for checking and managing the cluster.

Starting the cluster software on a node

Run commands in the following order:

# service cman start
# service rgmanager start

Stopping the cluster software on a node

Run commands in the following order:

# service rgmanager stop
# service cman stop

Stopping cluster software on a node causes its HA services to fail over to another node. As an alternative to that, consider relocating or migrating HA services to another node before stopping cluster software, using the clusvcadm command.

clustat

Use the clustat utility to display cluster-wide status, such as membership information, quorum and state of services:

# clustat -l
Cluster Status for MY_CLUSTER @ Mon Jun 20 15:59:48 2011
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node01                                                             1 Online, Local, rgmanager
 node02                                                             2 Online, rgmanager

Service Information
------- -----------

Service Name      : service:MY_SERVICE
  Current State   : started (112)
  Owner           : node01
  Last Owner      : node01
  Last Transition : Tue Jun 14 09:46:25 2011

Service status can be one of the following:

  • Started
  • Recovering – The service is pending start on another node.
  • Disabled – disabled service is never automatically started by the cluster.
  • Stopped – temporary state, the service will be evaluated for starting after the next service or node transition. You may disable or enable the service from this state.
  • Failed – the service is dead. A service is placed into this state whenever a resource’s stop operation fails. After a service is placed into this state, you must verify that there are no resources allocated (mounted file systems, for example) prior to issuing a disable request. The only operation that can take place when a service has entered this state is disable.
  • Uninitialized

clusvcadm

clusvcadm is a cluster administration utility, which enables an administrator to enable, disable, relocate and restart user services in cluster. Some of the useful clusvcadm commands are:

  • clusvcadm -e <service_name> -m <member> – start the service.
  • clusvcadm -d <service_name> – stop the service and place in disabled state. This is the only permissible operation when a service is in the failed state.
  • clusvcadm -r <service_name> -m <member> – relocate the service to another node. If no permissible target node in the cluster successfully starts the service, the relocation fails and the service is attempted to be restarted on the original owner. If the original owner cannot restart the service, the service is placed in the stopped state.
  • clusvcadm -s <service_name> – stop the service and place in stopped state.
  • clusvcadm -R <service_name> – restart the service.

cman_tool

cman_tool manages the cluster management subsystem. You can use this tool to join the node to a cluster, leave the cluster, kill another cluster node… Some useful cman_tool commands are:

  • cman_tool version -r – distribute the new cluster.conf version to all the nodes
  • cman_tool debug -d <value> – sets  the  debug  level  of the running cman daemon. Debug output will be sent to syslog level LOG_DEBUG. The -d switch specifies the new logging level:
    • 2 Barriers
    • 4 Membership messages
    • 8 Daemon operation, including command-line interaction
    • 16 Interaction with Corosync
    • 32 Startup debugging (cman_tool join operations only)

fence_tool

fence_tool controls and queries the fenced daemon. Some useful commands are:

  • fence_tool ls – display internal fenced state
  • fence_tool dump – print the internal fenced debug buffer to stdout

fence_node

fence_node can be used to fence the node using agent based on cluster.conf parameters.

Troubleshooting

Cluster can be difficult to troubleshoot and diagnose, but there are some common issues that system administrators are more likely to encounter when administrating the cluster. Here is a small list, that will hopefully expand over time:

  • Cluster uses multicast for communication between nodes, so make sure that this is not blocked, delayed or otherwise interfered
  • Ensure that firewall rules are not blocking the traffic
  • Ensure that interfaces that are used for cluster communication are not using some exotic bonding mode (0 or round-robin mode is fine) or VLAN tagging
  • Use tcpdump on each node to check network traffic
  • Cluster services may hang, and cluster nodes may have different view of cluster membership. Sometimes it is necessary to reboot the nodes to make the cluster up and running again. This conditions can be checked in the following ways:
    • Fence operation may have failed (check the logs for any failed fence messages)
    • Verify if network is up
    • Verify that if some nodes have left the cluster, if the cluster is quorate (if it is not – the service or storage fill hang)
  • Common problem is unusual failover behaviour. The services may refuse to start on failover. Make sure you understand how some features and conditions of your cluster service may affect failover.
  • The root cause of fences is always when a node loses a token, meaning that it lost communication with the rest of the cluster and stopped returning heartbeats.
    • If a node does not return a token within token interval, fence is taking place. Default token interval is 10 seconds, and it can be specified in cluster.conf

 

Comments are closed.