Enabling process restart on failure

Previous | Home | Next

Enabling process restart on failure

In a distributed environment, we can use the health management feature to monitor the status of application servers, nodes, clusters, dynamic clusters, on demand routers, and cells so that we can sense and respond to problem areas before an outage occurs. We can manage the health of an application serving environment with a policy-driven approach that enables specific actions to occur when monitored criteria is met. For example, for an application server, when memory usage exceeds a percentage of the heap size for a specified time, health policy actions can run to correct the situation. The following list shows some of the predefined health policy actions that are applicable to excessive memory usage:

Take thread dumps
Take JVM heap dumps
Generate a SNMP trap
Place server in maintenance mode
Place server in maintenance mode and break affinity
Place server out of maintenance mode
Restart server

All of the listed actions can be grouped and used in a custom sequence to help detect and correct the problem. We can use the dmgr console to set health policies by clicking Operational policies | Health policies.
Actions that you might set in case your server exceeds 90 percent of the JVM heap size for a period of two minutes.

The two reaction modes for the health management monitor are:

Supervise When the health condition is reached, a task is submitted with a suggested plan of action automatically carried out if the task is approved.
Automatic When the health condition is reached, the actions are automatically carried out in the order you previously defined.

We can define a large number of custom health conditions and actions for when the health conditions breach. Intelligent management features help you recover from the most common operational issues, and there is a more general way to restart your server processes. We can use the native operating system functionality to restart a failed process.
The following sections provide more information about how to set your operating system.

Windows

The administrator can choose to register one or more of the WAS processes on a machine as a Windows service during profile creation. It can also be done after profile creation using the WASService command. With this command, Windows automatically attempts to restart the service if it fails during use. Syntax Enter WASService.exe with no arguments to get a list of the valid formats.
WASService command format
Usage: WASService.exe -add <service name> -serverName <Server> -profilePath <Server's Profile Directory> [-wasHome <WebSphere Install Directory>] [-configRoot <Config Repository Directory>] [-startArgs <additional start arguments>] [-stopArgs <additional stop arguments>] [-userid <execution id> -password <password>] [-logFile <service log file>] [-logRoot < server's log directory>] [-encodeParams] [-restart <true | false>] [-startType <automatic | manual | disabled>] || -remove <service name> || -start <service name> [optional startServer.bat parameters] || -stop <service name> [optional stopServer.bat parameters] || -status <service name> || -encodeParams <service name>

Considerations...

When adding a new service, the -serverName argument is mandatory. The serverName is the process name. If in doubt, use the serverstatus -all command to display the processes. For a deployment manager, the serverName is dmgr. For a node agent, the server name is nodeagent, and for a server, it is the server name.
The -profilePath argument is mandatory. It specifies the home directory for the profile.

Use unique service names. The services are listed in the Windows Services control window as:
IBM WAS V8.0 - <service name>

The convention used by the Profile Management Tool is to use the node name as the service name for a node agent. For a deployment manager, it uses the node name of the deployment manager node concatenated with dmgr as the service name.

Registering a deployment manager as a Windows 7 service

$ runas /user:IBM-CMierlea\admin "/WAS/AppServer/bin\WASService -add "dmgr" -servername dmgr -profilePath "D:\was85\IBM\WebSphere\AppServr_85_01" -restart true" Enter the password for IBM-CMierlea\admin: Attempting to start /WAS/AppServer/bin\WASService -add dmgr -servername dmgr -profilePath /WAS/AppServer/profiles\Dmgr_85_01 -restart true as user "IBM-CM .. /WAS/AppServer/bin$

The service name added will be IBM WAS V8.5, concatenated with the name you specified for the service name. We can set recovery actions in case of failure using the Recovery tab under the Properties of the new service.

If you remove the service using the WASService -remove command, specify only the latter portion of the name.

/WAS/AppServer/bin$ runas /user:IBM-CMierlea\admin "/WAS/AppServer/bin\WASService -remove "dmgr""
Enter the password for IBM-CMierlea\admin:
Attempting to start /WAS/AppServer/bin\WASService -remove dmgr as user "IBM-CMierlea\admin" ...
/WAS/AppServer/bin$

UNIX and Linux

The administrator can choose to include entries in inittab for one or more of the WAS processes on a machine. Each such process is then automatically restarted if it has failed.
Inittab contents for process restart on deployment manager machine...
ws1:23:respawn:/usr/WebSphere/DeploymentManager/bin/startManager.sh

On node machine:
ws1:23:respawn:/usr/WebSphere/AppServer/bin/startNode.sh
ws2:23:respawn:/usr/WebSphere/AppServer/bin/startServer.sh nodename_server1
ws3:23:respawn:/usr/WebSphere/AppServer/bin/startServer.sh nodename_server2
ws4:23:respawn:/usr/WebSphere/AppServer/bin/startServer.sh nodename_server2

When setting the action for startServer.sh to respawn in /etc/inittab, be aware that init always restarts the process, even if you intended for it to remain stopped. As an alternative, we can use the rc.was script located in...
${WAS_HOME}/bin

...which allows you to limit the number of retries.
The best solution is to use a monitoring product that implements notification of outages and logic for automatic restart.

z/OS
WebSphere for z/OS takes advantage of the z/OS Automatic Restart Management (ARM) to recover application servers. Each application server running on a z/OS system (including servers you create for the business applications) are automatically registered with an ARM group. Each registration uses a special element type called SYSCB. ARM treats SYSCB as restart level 3, ensuring that RRS (a z/OS facility that provides two-phase sync point support across participating resource managers) restarts before any application server.
If we have an application critical for the business, you need facilities to manage failures. z/OS provides rich automation interfaces, such as automatic restart management, which we can use to detect and recover from failures. The automatic restart management handles the restarting of servers when failures occur.
Some important things to consider when using automatic restart management:

If we have ARM enabled on your system, you might want to disable ARM for the WAS for z/OS address spaces before you install and customize WAS for z/OS. During customization, job errors might cause unnecessary restarts of the WAS for z/OS address spaces. After installation and customization, consider enabling ARM.
If you are ARM-enabled and we cancel or stop a server, it will restart in place using the armrestart command.
It is a good idea to set up an ARM policy for the deployment manager and node agents.
If you start the location service daemon on a system that already has one, it will terminate.
Every other server comes up on a dynamic port unless the configuration has a fixed port.
Therefore, the fixed ports must be unique in a sysplex.
If you issue STOP, CANCEL, or MODIFY commands against server instances, be aware of how automatic restart management behaves regarding WAS for z/OS server instances.

ARM Behavior and WAS for z/OS server instances

When you issue ARM behavior STOP address_space It does not restart the address space.

CANCEL address_space It does not restart the address space.
CANCEL address_space, ARMRESTART It does restart the address space.
MODIFY address_space, CANCEL It does not restart the address space.
MODIFY address_space,
CANCEL,ARMRESTART j
It restarts the address space.
If you activated ARM and want to check the status of address spaces registered for automatic restart management:

Initialize all servers.
Issue one or both of the commands
Displaying the status of address spaces registered for ARM
To display all registered address spaces (including the address spaces of server instances), issue the command:
d xcf,armstatus,detail

To display the status of a particular server instance, use the display command and identify the job name. For example, to display the status of the Daemon server instance (job BBODMN), issue the following command:
d xcf,armstatus,jobname=bbodmn,detail

Supervise	When the health condition is reached, a task is submitted with a suggested plan of action automatically carried out if the task is approved.
Automatic	When the health condition is reached, the actions are automatically carried out in the order you previously defined.