Example RDQM HA configurations and errors
An example RDQM HA configuration, complete with example errors and information on how to resolve them.
The example RDQM HA group consists of three nodes:- mqhavm13.gamsworthwilliam.com (referred to as vm13).
- mqhavm14.gamsworthwilliam.com (referred to as vm14).
- mqhavm15.gamsworthwilliam.com (referred to as vm15).
Three RDQM HA queue managers have been created:
- HAQM1 (created on vm13)
- HAQM2 (created on vm14)
- HAQM3 (created on vm15)
Initial conditions
The initial condition on each of the nodes is given in the following listings:
- vm13
-
[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running CPU: 0.00% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM2 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM3 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm15.gamsworthwilliam.com HA preferred location: mqhavm15.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.
- vm14
-
[midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM1 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm13.gamsworthwilliam.com HA preferred location: mqhavm13.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM2 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running CPU: 0.00% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM3 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm15.gamsworthwilliam.com HA preferred location: mqhavm15.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.
- vm15
-
[midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM1 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm13.gamsworthwilliam.com HA preferred location: mqhavm13.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM2 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM3 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running CPU: 0.02% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.
DRBD scenarios
RDQM HA configurations use DRBD for data replication. The following scenarios illustrate the following possible problems with DRBD:- Loss of DRBD quorum
- Loss of a single DRBD connection
- Synchronization stuck
DRBD Scenario 1: Loss of DRBD quorum
If the node running an RDQM HA queue manager loses the DRBD quorum for the DRBD resource corresponding to the queue manager, DRBD immediately starts returning errors from I/O operations, which will cause the queue manager to start producing FDCs and eventually stop.
If the remaining two nodes have a DRBD quorum for the DRBD resource then Pacemaker chooses one of the two nodes to start the queue manager. Because there were no updates on the original node from the time where the quorum was lost, it is safe to start the queue manager somewhere else.
The two main ways that we can monitor for a loss of DRBD quorum are:- By using the rdqmstatus command.
- By monitoring the syslog of the node where the RDQM HA queue manager is initially running.
If we use the rdqmstatus command, if the node vm13 loses DRBD quorum for the DRBD resource for HAQM1, you might see status similar to the following example:
[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Remote unavailable HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Remote unavailable HA out of sync data: 0KB Node: mqhavm15.gamsworthwilliam.com HA status: Remote unavailable HA out of sync data: 0KB Command '/opt/mqm/bin/rdqmstatus' run with sudo.
Notice that the HA status has changed to Remote unavailable, which indicates that both DRBD connections to the other nodes have been lost.
In this case the other two nodes have DRBD quorum for the DRBD resource so the RDQM is running somewhere else, on mqhavm14.gamsworthwilliam.com as shown as the value of HA current location.
monitoring syslog
If you monitor syslog, we will see that DRBD logs a message when it loses quorum for a resource:Jul 30 09:38:36 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( yes -> no )When quorum is restored a similar message is logged:
Jul 30 10:27:32 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( no -> yes )
DRBD Scenario 2: Loss of a single DRBD connection
If only one of the two DRBD connections from a node running an RDQM HA queue manager is lost then the queue manager does not move.
Starting from the same initial conditions as in the first scenario, after blocking just one of the DRBD replication links, the status reported by rdqmstatus on vm13 is similar to the following example:Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running CPU: 0.01% Memory: 133MB Queue manager file system: 52MB used, 1.0GB allocated [5%] HA role: Primary HA status: Mixed HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Remote unavailable HA out of sync data: 0KB Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.
DRBD Scenario 3: Synchronization stuck
Some versions of DRBD had an issue where a synchronization would appear to be stuck and this prevented an RDQM HA queue manager from failing over to a node when the sync to that node is still in progress.
One way to see this is to use the drbdadm status command. When operating normally a response similar to the following example is output:[midtownjojo@mqhavm13 ~]$ drbdadm status haqm1 role:Primary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Secondary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Secondary peer-disk:UpToDate haqm2 role:Secondary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Primary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Secondary peer-disk:UpToDate haqm3 role:Secondary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Secondary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Primary peer-disk:UpToDateIf synchronization gets stuck, the response is similar to the following example:
[midtownjojo@mqhavm13 ~]$ drbdadm status haqm1 role:Primary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Secondary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Secondary replication:SyncSource peer-disk:Inconsistent done:90.91 haqm2 role:Secondary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Primary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Secondary peer-disk:UpToDate haqm3 role:Secondary disk:UpToDate mqhavm14.gamsworthwilliam.com role:Secondary peer-disk:UpToDate mqhavm15.gamsworthwilliam.com role:Primary peer-disk:UpToDate
In this case the RDQM HA queue manager HAQM1 cannot move to vm15 as the disk on vm15 is Inconsistent.
The done value is the percentage complete. If that value is not increasing you could try disconnecting that replica then connecting it again with the following commands (run as root) on vm13:drbdadm disconnect haqm1:mqhavm15.gamsworthwilliam.com drbdadm connect haqm1:mqhavm15.gamsworthwilliam.comIf the replication to both Secondary nodes is stuck, we can do the disconnect and connect commands without specifying a node and that will disconnect both connections:
drbdadm disconnect haqm1 drbdadm connect haqm1
Pacemaker scenarios
RDQM HA configurations use Pacemaker to determine where an RDQM HA queue manager runs. The following scenarios illustrate the following possible problems that involve Pacemaker:- Corosync main process not scheduled
- RDQM HA queue manager not running where it should
Pacemaker scenario 1: Corosync main process not scheduled
If you see a message in the syslog similar to the following example this indicates that the system is either too busy to schedule CPU time to the main Corosync process or, more commonly, that the system is a Virtual Machine and the Hypervisor has not scheduled any CPU time to the entire VM.corosync[10800]: [MAIN ] Corosync main process was not scheduled for 2787.0891 ms (threshold is 1320.0000 ms). Consider token timeout increase.
Both Pacemaker (and Corosync) and DRBD have timers that are used to detect loss of quorum, so messages like the example indicate that the node did not run for so long that it would have been dropped from the quorum. The Corosync timeout is 1.65 seconds and the threshold of 1.32 seconds is 80% of that, so the message shown in the example is printed when the delay in the scheduling of the main Corosync process hits 80% of the timeout. In the example the process was not scheduled for nearly three seconds. Whatever is causing such a problem must be resolved. One thing that might help in a similar situation is to reduce the requirements of the VM, for example, reducing the number of vCPUs required, as this makes it easier for the Hypervisor to schedule the VM.
Pacemaker scenario 2: An RDQM HA queue manager is not running where it should be
The main tool to help troubleshooting in this scenario is the crm status command. The following example shows a response for the configuration when everything is working as expected:Stack: corosync Current DC: mqhavm13.gamsworthwilliam.com (version 1.1.20.linbit-1+20190404+eab6a2092b71.el7.2-eab6a2092b) - partition with quorum Last updated: Tue Jul 30 09:11:29 2019 Last change: Tue Jul 30 09:10:34 2019 by root via crm_attribute on mqhavm14.gamsworthwilliam.com 3 nodes configured 18 resources configured Online: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] Full list of resources: Master/Slave Set: ms_drbd_haqm1 [p_drbd_haqm1] Masters: [ mqhavm13.gamsworthwilliam.com ] Slaves: [ mqhavm14.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm1 (ocf::heartbeat:Filesystem): Started mqhavm13.gamsworthwilliam.com p_rdqmx_haqm1 (ocf::ibm:rdqmx): Started mqhavm13.gamsworthwilliam.com haqm1 (ocf::ibm:rdqm): Started mqhavm13.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm2 [p_drbd_haqm2] Masters: [ mqhavm14.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm2 (ocf::heartbeat:Filesystem): Started mqhavm14.gamsworthwilliam.com p_rdqmx_haqm2 (ocf::ibm:rdqmx): Started mqhavm14.gamsworthwilliam.com haqm2 (ocf::ibm:rdqm): Started mqhavm14.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm3 [p_drbd_haqm3] Masters: [ mqhavm15.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com ] p_fs_haqm3 (ocf::heartbeat:Filesystem): Started mqhavm15.gamsworthwilliam.com p_rdqmx_haqm3 (ocf::ibm:rdqmx): Started mqhavm15.gamsworthwilliam.com haqm3 (ocf::ibm:rdqm): Started mqhavm15.gamsworthwilliam.comNote the following points:
- All three nodes are shown as Online.
- Each RDQM HA queue manager is running on the node where it was created, for example, HAQM1 is running on vm13 and so on.
This scenario is constructed by preventing HAQM1 from running on vm14, and then attempting to move HAQM1 to vm14. HAQM1 cannot run on vm14 because the file /var/mqm/mqs.ini on vm14 has an invalid value for the Directory of queue manager HAQM1.
The preferred location for HAQM1 is changed to vm14 by running the following command on vm13:rdqmadm -m HAQM1 -n mqhavm14.gamsworthwilliam.com -pThis command would normally cause HAQM1 to move to vm14 but in this case checking the status on vm13 returns the following information:
[midtonjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running CPU: 0.15% Memory: 133MB Queue manager file system: 52MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.HAQM1 is still running on vm13, it has not moved to vm14 as requested and the cause needs investigating. Examining the Pacemaker status gives the following response:
[midtownjojo@mqhavm13 ~]$ crm status Stack: corosync Current DC: mqhavm13.gamsworthwilliam.com (version 1.1.20.linbit-1+20190404+eab6a2092b71.el7.2-eab6a2092b) - partition with quorum Last updated: Thu Aug 1 14:16:40 2019 Last change: Thu Aug 1 14:16:35 2019 by hacluster via crmd on mqhavm14.gamsworthwilliam.com 3 nodes configured 18 resources configured Online: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] Full list of resources: Master/Slave Set: ms_drbd_haqm1 [p_drbd_haqm1] Masters: [ mqhavm13.gamsworthwilliam.com ] Slaves: [ mqhavm14.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm1 (ocf::heartbeat:Filesystem): Started mqhavm13.gamsworthwilliam.com p_rdqmx_haqm1 (ocf::ibm:rdqmx): Started mqhavm13.gamsworthwilliam.com haqm1 (ocf::ibm:rdqm): Started mqhavm13.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm2 [p_drbd_haqm2] Masters: [ mqhavm14.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm2 (ocf::heartbeat:Filesystem): Started mqhavm14.gamsworthwilliam.com p_rdqmx_haqm2 (ocf::ibm:rdqmx): Started mqhavm14.gamsworthwilliam.com haqm2 (ocf::ibm:rdqm): Started mqhavm14.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm3 [p_drbd_haqm3] Masters: [ mqhavm15.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com ] p_fs_haqm3 (ocf::heartbeat:Filesystem): Started mqhavm15.gamsworthwilliam.com p_rdqmx_haqm3 (ocf::ibm:rdqmx): Started mqhavm15.gamsworthwilliam.com haqm3 (ocf::ibm:rdqm): Started mqhavm15.gamsworthwilliam.com Failed Resource Actions: * haqm1_monitor_0 on mqhavm14.gamsworthwilliam.com 'not installed' (5): call=372, status=complete, exitreason='', last-rc-change='Thu Aug 1 14:16:37 2019', queued=0ms, exec=17ms
Take note of the Failed Resource Actions section that has appeared.
The name of the action, haqm1_monitor_0 tells us that it was a monitor action for the RDQM HAQM1 that failed, and it failed on mqhavm14.gamsworthwilliam.com, so it looks like Pacemaker tried to do what we expected and start HAQM1 on vm14, but for some reason it could not.
We can see when Pacemaker tried do this by looking at the value of the last-rc-change parameter.
Understand the failure
To understand the failure we need to look at the syslog for vm14 at the time of the failure:Aug 1 14:16:37 mqhavm14 crmd[26377]: notice: Result of probe operation for haqm1 on mqhavm14.gamsworthwilliam.com: 5 (not installed)
The entry shows that when Pacemaker tried to check the state of haqm1 on vm14 it got an error because haqm1 is not configured, which is because of the deliberate misconfiguration in /var/mqm/mqs.ini.
Correcting the failure
To correct the failure we must correct the underlying problem (in this case restoring the correct directory value for haqm1 in /var/mqm/mqs.ini on vm14). Then we must clear the failed action by using the command crm resource cleanup on the appropriate resource, which in this case is the resource haqm1 as that is the resource mentioned in the failed action. For example:[midtownjojo@mqhavm13 ~]$ crm resource cleanup haqm1 Cleaned up haqm1 on mqhavm15.gamsworthwilliam.com Cleaned up haqm1 on mqhavm14.gamsworthwilliam.com Cleaned up haqm1 on mqhavm13.gamsworthwilliam.comThen check the Pacemaker status again:
[midtownjojo@mqhavm13 ~]$ crm status Stack: corosync Current DC: mqhavm13.gamsworthwilliam.com (version 1.1.20.linbit-1+20190404+eab6a2092b71.el7.2-eab6a2092b) - partition with quorum Last updated: Thu Aug 1 14:23:17 2019 Last change: Thu Aug 1 14:23:03 2019 by hacluster via crmd on mqhavm13.gamsworthwilliam.com 3 nodes configured 18 resources configured Online: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] Full list of resources: Master/Slave Set: ms_drbd_haqm1 [p_drbd_haqm1] Masters: [ mqhavm14.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm1 (ocf::heartbeat:Filesystem): Started mqhavm14.gamsworthwilliam.com p_rdqmx_haqm1 (ocf::ibm:rdqmx): Started mqhavm14.gamsworthwilliam.com haqm1 (ocf::ibm:rdqm): Started mqhavm14.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm2 [p_drbd_haqm2] Masters: [ mqhavm14.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm15.gamsworthwilliam.com ] p_fs_haqm2 (ocf::heartbeat:Filesystem): Started mqhavm14.gamsworthwilliam.com p_rdqmx_haqm2 (ocf::ibm:rdqmx): Started mqhavm14.gamsworthwilliam.com haqm2 (ocf::ibm:rdqm): Started mqhavm14.gamsworthwilliam.com Master/Slave Set: ms_drbd_haqm3 [p_drbd_haqm3] Masters: [ mqhavm15.gamsworthwilliam.com ] Slaves: [ mqhavm13.gamsworthwilliam.com mqhavm14.gamsworthwilliam.com ] p_fs_haqm3 (ocf::heartbeat:Filesystem): Started mqhavm15.gamsworthwilliam.com p_rdqmx_haqm3 (ocf::ibm:rdqmx): Started mqhavm15.gamsworthwilliam.com haqm3 (ocf::ibm:rdqm): Started mqhavm15.gamsworthwilliam.comThe failed action has disappeared and HAQM1 is now running on vm14 as expected. The following example shows the RDQM status:
[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.Parent topic: Troubleshooting RDQM configurations