Verifying shared file system behavior on Multiplatforms
Run amqmfsck to check whether a shared file system on UNIX
and IBM® i systems meets the
requirements for storing the queue manager data of a multi-instance queue manager. Run the IBM MQ MQI client sample program amqsfhac in
parallel with amqmfsck to demonstrate that a queue manager maintains message
integrity during a failure.
Before you begin
You need a server with networked storage, and two other servers connected to it that have
IBM MQ installed. You must have administrator (root)
authority to configure the file system, and be an IBM MQ Administrator to run amqmfsck.
Failover of a multi-instance queue manager can be triggered by hardware or software failures,
including networking problems which prevent the queue manager writing to its data or log files.
Mainly, you are interested in causing failures on the file server. But you must also cause the
IBM MQ servers to fail, to test any locks are
successfully released. To be confident in a shared file system, test all of the following failures,
and any other failures that are specific to your environment:
Shutting down the operating system on the file server including syncing the disks.
Halting the operating system on the file server without syncing the disks.
Pressing the reset button on each of the servers.
Pulling the network cable out of each of the servers.
Pulling the power cable out of each of the servers.
Switching off each of the servers.
Create the directory on the networked storage that you are going to use to share queue manager
data and logs. The directory owner must be an IBM MQ Administrator, or in other words, a member of the mqm group on UNIX. The user who runs the tests must have IBM MQ Administrator authority.
In each of the checks, cause all the failures in the previous list while the file
system checker is running. If you intend to run amqsfhac at the same time as
amqmfsck, do the task, Running amqsfhac to test message integrity in parallel with this task.
Mount the exported directory on the two IBM MQ servers.
On the file system server create a shared directory shared, and a
subdirectory to save the data for multi-instance queue managers, qmdata. For an
example of setting up a shared directory for multi-instance queue managers on Linux, see Example in Create a multi-instance queue manager on
Linux
Check basic file system behavior.
On one IBM MQ server, run the file system checker
with no parameters.
Check concurrently writing to the same directory from both IBM MQ servers.
On both IBM MQ servers, run the file system checker
at the same time with the -c option.
Check waiting for and releasing locks on both IBM MQ servers.
On both IBM MQ servers run the file system checker at
the same time with the -w option.
Check for data integrity.
Format the test file.
Create a large file in the directory being tested. The file is formatted so that the subsequent
phases can complete successfully. The file must be large enough that there is sufficient time to
interrupt the second phase to simulate the failover. Try the default value of 262144 pages (1 GB).
The program automatically reduces this default on slow file systems so that formatting completes in
about 60 seconds
Write data into the test file using the file system checker while causing a failure.
Run the test program on two servers at the same time. Start the test program on the server which
is going to experience the failure, then start the test program on the server that is going to
survive the failure. Cause the failure you are investigating.
The first test program stops with an error message. The second test program obtains the lock on
the test file and writes data into the test file starting where the first test program left off. Let
the second test program run to completion.
Table 1. Running the data integrity check on two servers at the same time
IBM MQ server 1
IBM MQ server 2
amqmfsck -a /shared/qmdata
Please start this program on a second machine
with the same parameters.File lock acquired.Start a second copy of this program
with the same parameters on another server.
Writing data into test file.
To increase the effectiveness of the test,
interrupt the writing by ending the process,
temporarily breaking the network connection
to the networked storage,
rebooting the server or turning off the power.
amqmfsck -a /shared/qmdata
Waiting for lock...Waiting for lock...Waiting for lock...Waiting for lock...Waiting for lock...Waiting for lock...
Turn the power off here.
File lock acquired.Reading test fileChecking the integrity of the data read.Appending data into the test file
after data already found.The test file is full of data.
It is ready to be inspected for data integrity.
The timing of the test depends on the behavior of the file system. For example, it typically
takes 30 - 90 seconds for a file system to release the file locks obtained by the first program
following a power outage. If we have too little time to introduce the failure before the first test
program has filled the file, use the -x option of amqmfsck to
delete the test file. Try the test from the start with a larger test file.
Verify the integrity of the data in the test file.
Delete the test files.
The server responds with the message:
Test files deleted.
Results
The program returns an exit code of zero if the tests complete successfully, and non-zero
otherwise.
Examples
The first set of three examples shows the command producing minimal output.
Successful test of basic file locking on one server
> amqmfsck /shared/qmdata
The tests on the directory completed successfully.
> amqmfsck -w /shared/qmdata
Please start this program on a second
machine with the same parameters.
Lock acquired.
Press Return
or terminate the program to release the lock.
> amqmfsck -w /shared/qmdata
Waiting for lock...
[ Return pressed ]
Lock released.
Lock acquired.
The tests on the directory completed successfully
The second set of three examples shows the same commands using verbose mode.
Successful test of basic file locking on one server
> amqmfsck -v /shared/qmdata
System call: stat("/shared/qmdata")'
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fchmod(fd, 0666)
System call: fstat(fd)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: write(fd)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: close(fd)
System call: fd1 = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd1, F_SETLK, F_RDLCK)
System call: fd2 = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd2, F_SETLK, F_RDLCK)
System call: close(fd2)
System call: write(fd1)
System call: close(fd1)
The tests on the directory completed successfully.
Failed test of basic file locking on one server
> amqmfsck -v /shared/qmdata
System call: stat("/shared/qmdata")
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fchmod(fd, 0666)
System call: fstat(fd)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: write(fd)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_WRLCK)
System call: close(fd)
System call: fd = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fd, F_SETLK, F_RDLCK)
System call: fdSameFile = open("/shared/qmdata/amqmfsck.lck", O_RDWR, 0666)
System call: fcntl(fdSameFile, F_SETLK, F_RDLCK)
System call: close(fdSameFile)
System call: write(fd)
AMQxxxx: Error calling 'write()[2]' on file '/shared/qmdata/amqmfsck.lck', errno 2
(Permission denied).
Successful test of locking on two servers
Table 3. Successful locking on two servers - verbose mode
IBM MQ server 1
IBM MQ server 2
> amqmfsck -wv /shared/qmdata
Calling 'stat("/shared/qmdata")'
Calling 'fd = open("/shared/qmdata/amqmfsck.lkw",
O_EXCL | O_CREAT | O_RDWR, 0666)'
Calling 'fchmod(fd, 0666)'
Calling 'fstat(fd)'
Please start this program on a second
machine with the same parameters.
Calling 'fcntl(fd, F_SETLK, F_WRLCK)'
Lock acquired.
Press Return
or terminate the program to release the lock.
Calling 'fcntl(fd, F_SETLK, F_WRLCK)'
Lock acquired.
The tests on the directory completed successfully
Running amqsfhac to test message integrity
amqsfhac checks that a queue manager using networked storage maintains data
integrity following a failure.
Before you begin
You require four servers for this test. Two servers for the multi-instance queue manager, one for
the file system, and one for running amqsfhac as a IBM MQ MQI client application.
Follow step 1 in Procedure to set
up the file system for a multi-instance queue manager.
About this task
Procedure
Create a multi-instance queue manager on another server, QM1, using the file
system you created in step 1 in Procedure.
Start the queue manager on both servers making it highly available.
On server 1:
strmqm -x QM1
On server 2:
strmqm -x QM1
Set up the client connection to run amqsfhac.
Use the procedure in Verifying an IBM MQ installation for the platform, or platforms, that your enterprise use to set up a client
connection, or the example scripts in Reconnectable client samples.
Modify the client channel to have two IP addresses, corresponding to the two servers running
QM1.
In the example script, modify:
Where server1 and server2 are the host names of the two servers,
and 2345 is the port that the channel listener is listening on. Usually this
defaults to 1414. We can use 1414 with the default listener
configuration.
Create two local queues on QM1 for the test.
Run the following MQSC script:
If you stop the active queue manager instance, amqsfhac reconnects to the
other queue manager instance once it has become active. Restart the stopped queue manager instance
again, so that we can reverse the failure in your next test. You will probably need to increase the
number of iterations based on experimentation with your environment so that the test program runs
for sufficient time for the failover to occur.
Results
An example of running amqsfhac in step 6 is shown in
Figure 9. The test is a success.
If the test detected a problem, the output would report the failure. In some test runs
MQRC_CALL_INTERRUPTED might report Resolving to backed out.
It makes no difference to the result. The outcome depends on whether the write to disk was committed
by the networked file storage before or after the failure took place.