Important note:
Heartbeat is now obsolete and has moved to a new stack available on Clusterlabs. For simple high availability project using a virtual IP, try out keepalived that does monitoring and failover with just a simple configuration file.
Since version 2, Heartbeat is able to manage more than 2 nodes, and doesn’t need “mon” utility to monitor services.
This fonctionnality is now implemented within Heartbeat.
As a consequence, this flexibility and new features may make it harder to configure.
It’s intresting to know version 1 configuration files are still supported.
Installation
Heartbeat source files are available from the official site http://www.linux-ha.org. Redhat Enterprise compatible rpms can be downloaded from the Centos website (link given from Heartbeat website). Packages are built pretty quickly as source version is 2.1.0 when rpm is 2.0.8-2 at the time this article was made. 3 rpms are needed:
heartbeat-pils
heartbeat-stonith
heartbeat
We are going to monitor Apache but the configuration remains valid for any other service: mail, database, DNS, DHCP, file server, etc…
Diagrams
Failover or load-balancing
Heartbeat supports Active-Passive for the failover mode and Active-Active for the load-balancing.
Many other options are possible adding servers and/or services; There are many many possibilities.
We will focus on load-balancing, only a few lines need to be removed for failover.
Configuration
In this setup, the 2 servers are interconnected through their eth0 interface.
Applicative flows come on eth1, that is configured as in the diagram below, with addresses 192.168.0.4 and .5
Addresses on eth0 have to be in a sub-network dedicated to Heartbeat.
These addresses must appear in /etc/hosts on every node.
3 files must be configured for Heartbeat 2. Again, they are identical on each node.
In /etc/ha.d/
In /var/lib/heartbeat/crm/
ha.cf
ha.cf contains the main settings like cluster nodes or the communication topology.
use_logd on
# Heartbeat packets sending frequency
keepalive 500ms # seconds - Specify ms for times shorter than 1 second
# Period of time after which a node is declared "dead"
deadtime 2
# Send a warning message
# Important to adjust the deadtime value
warntime 1
# Identical to deadtime but when initializing
initdead 8
updport 694
# Host to be tested to check the node is still online
# The default gateway most likely
ping 192.168.0.1
# Interface used for Heartbeat packets
# Use the serial port,
# "mcast" and "ucast" for multicast and unicast
bcast eth0
# Resource switches back on the primary node when it's available
auto_failback on
# Cluster Nodes List
node n1.domain.com
node n2.domain.com
# Activate Heartbeat 2 Configuration
crm yes
# Allow to add dynamically a new node to the cluster
autojoin any
Other options are available such as compression or bandwidth for communication on a serial cable. Check http://linux-ha.org/ha.cf
authkeys
Authkeys defines authentication keys.
Several types are available: crc, md5 and sha1. crc is to be used on a secured sub-network (vlan isolation or cross-over cable.
sha1 offers a higher level of secuirity but can consume a lot CPU resources.
md5 sits in between. It’s not a bad choice.
auth 1
1 md5 secret
Password is stored in clear text, it is important to change the file permissions to 600.
cib.xml
<cib>
<configuration>
<crm_config/>
<nodes/>
<resources>
<group id="server1">
<primitive class="ocf" id="IP1" provider="heartbeat" type="IPaddr">
<operations>
<op id="IP1_mon" interval="10s" name="monitor" timeout="5s"/>
</operations>
<instance_attributes id="IP1_inst_attr">
<attributes>
<nvpair id="IP1_attr_0" name="ip" value="192.168.0.2"/>
<nvpair id="IP1_attr_1" name="netmask" value="24"/>
<nvpair id="IP1_attr_2" name="nic" value="eth1"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" id="apache1" provider="heartbeat" type="apache">
<operations>
<op id="jboss1_mon" interval="30s" name="monitor" timeout="20s"/>
</operations>
</primitive>
</group>
<group id="server2">
<primitive class="ocf" id="IP2" provider="heartbeat" type="IPaddr">
<operations>
<op id="IP2_mon" interval="10s" name="monitor" timeout="5s"/>
</operations>
<instance_attributes id="IP2_inst_attr">
<attributes>
<nvpair id="IP2_attr_0" name="ip" value="192.168.0.3"/>
<nvpair id="IP2_attr_1" name="netmask" value="24"/>
<nvpair id="IP2_attr_2" name="nic" value="eth1"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" id="apache2" provider="heartbeat" type="apache">
<operations>
<op id="jboss2_mon" interval="30s" name="monitor" timeout="20s"/>
</operations>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="location_server1" rsc="server1">
<rule id="best_location_server1" score="100">
<expression_attribute="#uname" id="best_location_server1_expr" operation="eq"
value="n1.domain.com"/>
</rule>
</rsc_location>
<rsc_location id="location_server2" rsc="server2">
<rule id="best_location_server2" score="100">
<expression_attribute="#uname" id="best_location_server2_expr" operation="eq"
value="n2.domain.com"/>
</rule>
</rsc_location>
<rsc_location id="server1_connected" rsc="server1">
<rule id="server1_connected_rule" score="-INFINITY" boolean_op="or">
<expression id="server1_connected_undefined" attribute="pingd"
operation="not_defined"/>
<expression id="server1_connected_zero" attribute="pingd" operation="lte"
value="0"/>
</rule>
</rsc_location>
<rsc_location id="server2_connected" rsc="server2">
<rule id="server2_connected_rule" score="-INFINITY" boolean_op="or">
<expression id="server2_connected_undefined" attribute="pingd"
operation="not_defined"/>
<expression id="server2_connected_zero" attribute="pingd" operation="lte"
value="0"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
It’s possible to generate the file from a Heartbeat 1 configuration file, haresources located in /etc/ha.d/, with the following command:
python /usr/lib/heartbeat/haresources2cib.py > /var/lib/heartbeat/crm/cib.xml
The file can be split in 2 parts: resources and constraints.
Resources
Resources are organized in groups (server1 & 2) putting together a virtual IP address and a service: Apache.
Resources are declared with the <primitive> syntax within the group.
Groups are useful to gather several resources under the same constraints.
The IP1 primitive checks virtual IP 1 is reachable.
It executes OCF type IPaddr script.
OCF scripts are provided with Heartbeat in the rpm packages.
It’s also possible to specify the virtual address, the network mask as well as the interface.
Apache resource type is LSB, meaning it makes a call to a startup script located in the usual /etc/init.d.
The script’s name in the variable type: type=”name”.
In order to run with Heartbeat, the script must be LSB compliant. LSB compliant means the script must:
- return its status with “script status”
- not fail “script start” on a service that is already running
- not fail stopping a service already stopped
All LSB specifications can be checked at http://www.linux- ha.org/LSBResourceAgent
The following time values can be defined:
- interval: defines how often the resource’s status is checked
- timeout: defines the time period before considering a start, stop or status action failed
Constraints
2 constraints apply to each group of resources:
The favourite location where resources “should” run.
We give a score of 100 to n1.domain.com for the 1st group.
Hence, if n1.domain.com is active, and option auto_failback is set to “on”, resources in this group will always come back there.
Action depending on ping result. If none of the gateways answer ping packets, resources move on another server, and the node goes to standby status.
Score -INFINITY means the the node will never ever accept these resources if the gateways are unreachable.
Important Notes
Files Rights
/etc/ha.cf directory contains sensible data. Rights have to be changed to be accessed by the owner only – or the application will not launch.
chmod 600 /etc/ha.d/ha.cf
The /var/lib/heartbeat/crm/cib.crm file has to belong to hacluster and group haclient. It must not be accessible by other users.
chown hacluster:haclient /var/lib/heartbeat/crm/cib.crm
chmod 660 /var/lib/heartbeat/crm/cib.crm
cib.xml is accessed in write mode by Heartbeat. Some files are created at the same time.
If you need to edit it manually, stop Heartbeat on all nodes, remove cib.xml.sig in the same directory, edit cib.xml on all nodes, and restart Heartbeat.
It’s advised to use crm_resource to bring modifications (see section below).
hosts files
The /etc/hosts file must contain all the cluster nodes hostnames. These names should be identical to the result of ‘uname -n’ command.
Services Startup
Heartbeat starts Apache when the service is inactive. It is better to disable Apache automatic startup with chkconfig for instance, as it may take a while to boot, Heartbeat could start it a second time.
chkconfig httpd off
There is a startup script in /etc/init.d/. To launch Heartbeat, run “/etc/init.d/hearbeat start”. Yu can launch Heartbeat automatically at boot time:
chkconfig heartbeat on
Behaviour and Tests
The following actions simulate incidents that may occur at some stage, and alter the cluster’s health. We study Heartbeat’s behaviour in each case.
Apache shutdown
Local Heartbeat will restart Apache. If the reboot fails, a warning is sent to the logs. Heartbeat won’t launch the service anymore then. The virtual IP address remains on the node. However, the address can be moved manually with Heartbeat tools.
Server’s Shutdown or Crash
Virtual addresses move to other servers.
Heartbeat’s Manual Shutdown on one Node
Heartbeat stops Heartbeat. Virtual addresses are moved to other nodes. The normal procedure is to run crm_standby to turn the node into standby mide and migrate resources accross.
Node to node cable disconnection
Each machine thinks it’s on its own and takes the 2 virtual IPs. However, it is not really a problem as the gateway keeps on sending packets to the last entry in the ARP table. A ping on the broadcast address sends duplicates.
Gateway disconnection
This could happen in 2 cases:
– The gateway is unreachable. In this case, the 2 nodes remove their virtual IP address and stop Apache.
– The connection to one of the nodes is down. Addresses move to the other server which can normally reach the gateway.
All simulations allow to maintain the cluster up except if the gateway is gone. But this is not a cluster problem…
Tools
Heartbeat’s health can be checked via different ways.
Logs
Heartbeat sends messages to the logd daemon that stores them in the /var/log/messages system file.
Unix Tools
Usual Unix commands can be used. Heartbeat creates sub-interfaces that you can check with “ifconfig” or “ip address show”. Processes status can be displayed with the startup scripts or with the “ps” command for instance.
Heartbeat Commands
Heartbeat is provided with a set of commands. Here are the main ones:
- crmadmin: Controls nodes managers on each machine of the cluster.
- crm_mon: Quick and useful; Displays nodes status and resources.
- crm_resource: Makes requests and modifies resources/services related data. Ability to list, migrate, disable or delete resources.
- crm_verify: Reports the cluster’s warnings and errors
- crm_standby: Migrate all node’s resources. Useful when upgrading.
Maintenance
Service Shutdown
When upgrading or shutting Apache down, you may proceed as follow:
- Stop Apache on node 1 and migrate the virtual address on node 2:
crm_standby -U n1.domain.com -v true
- Start upgrading n1 and restart the node:
crm_standby -U n1.domain.com -v false
- Resources automatically fail back to n1 (if cib.xml is properly configured)
Proceed the same way on n2
Rem: The previous commands can be run from any node of the cluster.
It is possible to switch the 2 nodes on standby at the same time; Apache will be stopped on the 2 machines and the virtual addresses deleted until it comes back to running mode.
Machine Reboot
It is not required to switch the node to standby mode before rebooting. However, it is better practice as resources migrate faster, detection time being inexistant.
Resource Failure
Example: Apache crashed and doesn’t restart anymore, all group resources are moved to the 2nd node.
When the issue is resolved and the resource is up again, the cluster’s health can be checked with:
crm_resource –reprobe (-P)
and the resource restarted:
crm_resource –cleanup –resource apache1 (ou crm_resource -C -r apache1).
It will move automatically back to the original server.
Adding a New Node to the cluster
In Standby
If the cluster contains 2 nodes connected with a cross-over cable, you then will need a switch for the heartbeats network interfaces.
You first need to add new node’s informations. Edit the /etc/hosts files on the current nodes and add the node’s hostname. /etc/hosts contents must be copied accross to the new node. Configure Heartbeat in the same way as other nodes.
ha.cf files must contain the “autojoin any” setting to accept new nodes on the fly.
On the new host, start Heartbeat; The should join the cluster automatically.
If not, run the following on one of the 1st nodes:
/usr/lib/heartbeat/hb_addnode n3.domain.com
The new node only acts as a failover, no service is associated. If n1.domain.com goes in standby, resources will move to n3. They will come back to the original server as soon as it comes back up (n1 being the favourite server).
With a New Service
To add a 3rd IP (along with a 3rd Apache), you need to follow the above procedure and then:
– Either stop Heartbeat on the 3 servers and edit cib.xml files
– Or build similar files to cib.xml, containing new resources and constraints to be added, and add them to the cluster in live. This is the preferred method. Create the following files on one of the nodes:
newGroup.xml
<group id="server3">
<primitive class="ocf" provider="heartbeat" type="IPaddr" id="IP3">
<operations>
<op id="IP3_mon" interval="5s" name="monitor" timeout="2s"/>
</operations>
<instance_attributes id="IP3_attr">
<attributes>
<nvpair id="IP3_attr_0" name="ip" value="192.168.0.28"/>
<nvpair id="IP3_attr_1" name="netmask" value="29"/>
<nvpair id="IP3_attr_2" name="nic" value="eth1"/>
</attributes>
</instance_attributes>
</primitive>
<primitive class="lsb" provider="heartbeat" type="apache" id="apache3">
<operations>
<op id="apache3_mon" interval="5s" name="monitor" timeout="5s"/>
</operations>
</primitive>
</group>
newLocationConstraint.xml
<rsc_location id="location_server3" rsc="server3">
<rule id="best_location_server3" score="100">
<expression attribute="#uname" id="best_location_server3_expr" operation="eq"
value="n3.domain.com"/>
</rule>
</rsc_location>
newPingConstraint.xml
<rsc_location id="server3_connected" rsc="server3">
<rule id="server3_connected_rule" score="-INFINITY" boolean_op="or">
<expression id="server3_connected_undefined" attribute="pingd"
operation="not_defined"/>
<expression id="server3_connected_zero" attribute="pingd" operation="lte"
value="0"/>
</rule>
</rsc_location>
Add n3 constraints
cibadmin -C -o constraints -x newLocationConstraint.xml
cibadmin -C -o constraints -x newPingConstraint.xml
Add n3 resources
cibadmin -C -o resources -x newGroup.xml
n3 resources should start right away.
Note: Adding constraints can be done one at a time. If you try to add 2 constraints from within the same file, only the first will be set.
Tags: cluster, high availability, linux