This article mainly focuses on introducing one method of heartbeat debugging. It will include some key points listed as below:
- Some basic thoughts on how to diagnose a network bug among cluster
- How to handle network level 2 problem
- Some frequently used commands or tools: tcpdump, arp, etc.
Prerequirements
First of all assume that we have two nodes installed heartbeat. Well, if you do not know about heartbeat, here is some useful resource links: - Heartbeat Official User Guide - Rakuten DBA HeartBeat Setting manual
Environment
As mentioned above, We’ve got two nodes that which had already installed heartbeat. Let us assume their IP listed as below:
1 | 192.168.40.64 |
Before start them, we need setup their configuration files, generally speaking it includs:
1 | ha.cf |
For ha.cf
, the configuration is listed as below:
Among ha.cf, some fields has is easy to know, some we would pick up and describe. ‘crm off’ means Cluster Resource Manager(crm) is off. ‘auto_failback off’ means standby is working and it will not return its resource whether master is recovered or not. ‘node node-0’ means config every node need connect through heartbeat and ‘node-0’ is server name, you can use ‘uname -n’ get it. haresource will configure a virtual IP address to make heartbeat nodes act as one server. Below is configuration example:
‘192.168.40.1’ is a virtual IP, the most important here you need specify is when your nodes real IP and virtual IP are not in the same subnet, please add netmask, in this configuration file is ‘24’ authkeys stores how authenticate among heartbeat nodes. generally speaking there are three authentication types. ‘crc’, ‘md5’ and ‘sha1’, if you network is secure, use ‘crc’ is enough. If not please use ‘sha1’. Here is the example:
Some useful commands of start / stop heartbeat as a service.
1 | sudo service heartbeat start |
But please always remember it is only useful under Ubuntu and install heartbeat as a service.
Debug
Frankly speaking, if you setup your configuration file as above example, you maybe configure heartbeat successfully without debug. But if it is just not work, so we still need to learn how to debug with heartbeat.
Logs
Every mature software or service should have logs, no exception for heartbeat. The log addresses is right in the ha.cf config file.
1 | tail -f /var/log/ha-debug |
Use ‘tail -f’ or ‘vim’ depends on yourself.
Virtual IP Interface
When every heartbeat starts it will try to setup a Virtual IP address, this address is specified in haresource file. We can use ‘ifconfig’ command easily observe this. But It reports an error, here is an useful command to debug. sudo ifconfig eth0:100 192.168.40.1 netmask 255.255.255.0 broadcast 192.168.40.255 This command will setup a virtual IP Interface with specified IP 192.168.40.1 and also netmask, broadcast.
ARP
Using arp, we can find whether heartbeat nodes are connected with each other or not. Command is like this:
1 | arp -a |
Actually every IP Interface would have a arp list, it will store mappings between IP address and MAC address. If they are connected, we can got a record like this:
1 | node-0 (192.168.40.64) at 00:50:56:cc:ee:ff [ehter] on eth0 |
SEND_ARP & TCPDUMP
Using send_arp to mock a virtual IP and use TCPDUMP to detect is there any arp requests online. Some commands are listed as below:
1 | send_arp 192.168.40.4 00:11:22:aa:bb:cc 192.168.40.4 fffffffffff |
Using ‘tcpdump’ to catch every arp request and response. Here is the example command:
1 | sudo tcpdump -i eth0 arp |
Example result is listed as below:
1 | 07:46:07.651105 ARP, Request who-has 127.0.1.1 tell 127.0.1.1, length 46 |
Nats Server working with heartbeat
Regarding to our team, heartbeat will work for nats server to promise that nats-server will always alive. In terms of nats server, we’ve one useful command ‘nats-top’ to check whether which is working.
1 | cd nats-installation-directory |
Conclusion
Debug with heartbeat is not only focus on heartbeat, but also know about environment context, research on arp protocol. In this means what we are doing is totally based on what we learned before such as arp, IP Interface. But we need to use what we learn in real work and also try it, fix it as quick as possible. To profit our team, even our company.