Troubleshoot health management

Operating Systems: AIX, HP-UX, Linux, Solaris, Windows, z/OS

Troubleshooting health management

You can look for the following problems when health management is not working, or not working the way you expect.

Finding the right logs

The health controller is a distributed resource that is managed by the high availability (HA) manager. It exists within all node agent and deployment manager processes and is active within one of these processes. If a process fails, the controller becomes active on another node agent or deployment manager process.
To determine where the health controller is running, click Runtime Operations > Extended Deployment > Core components in the administrative console. The location and stability status of the health controller displays.

Performance advisor is enabled with the default memory leak health policy
The default memory leak health policy uses the performance advisor functionality, so the performance advisor is enabled when this policy has members assigned. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, consider keeping the default memory leak policy, but removing all of the members. To change the members, click Operational policies > Health policies > Default_Memory_Leak. You can edit the health policy memberships by adding and removing specific members from the policy.

Health controller settings
The following list contains issues that are encountered as a result of the health controller settings:

Health controller is disabled

Verify the setting in the administrative console by clicking Operational policies > Autonomic controllers > Health controller and select both the Configuration and Runtime tabs. The health controller is enabled by default.

Restarts are prohibited at this time

Verify the prohibited restart times in the administrative console by clicking Operational policies > Autonomic controllers > Health controller and by selecting the Prohibited restart field. By default, no times are prohibited.

Restarting too soon after the previous restart

To check the minimum restart interval in the administrative console, click Operational policies > Autonomic controllers > Health controller modify the Minimum Restart Interval field. No minimum interval is defined by default.

Control cycle is too long

To check the control cycle length in the administrative console, click Operational policies > Autonomic controllers > Health controller and adjust the value if necessary. The health controller checks for policy violations periodically. If its control cycle length is too long, it might not restart servers quickly enough.

The server has been restarted X times consecutively, and the health condition continues to be violated

In this case, X indicates the maximum consecutive restart parameter of the health controller. The health controller concludes that restarts are not fixing the problem, and disables the restarts for the server. The following message displays in the log:
WXDH0011W: Server servername exceeded maximum verification failures: disabling restarts. The health controller continues to monitor the server and displays messages in the log if the health policy is violated:
WXDH0012W: Server servername with restarts disabled failed health check. You can enable restarts for the server by performing any of the following actions:

Disable and then enable the health controller.
Adjust the Maximum Consecutive Restarts controller setting.
Run the following command from the prompt:
wsadmin -profile HmmControllerProcs.jacl enableServer servername This script is available in the <install_root>\bin directory on the node agent or deployment manager nodes. This script requires a running deployment manager.

Health policy settings
The following issues are encountered as a result of the health policy settings:
The server is not part of a health policy

Verify that the health policy memberships apply to your server in the administrative console by clicking Operational policies > Health policies.

The reaction mode of a policy containing the server is supervised
Check the administrative console by clicking Runtime Operations > Task Management > Runtime tasks to find approval requests for a restart action for a policy in Supervised mode. Servers are restarted automatically when you set Automatic as the reaction mode. The following message is written to the log for the supervised condition:
WXDH0024I: Server server name has violated the health policy health condition, reaction mode is supervised.
The server is a member of a static cluster and is the only cluster member running

The health policy does not bring down all members of a cluster at the same time. If a cluster has one cluster member, or one cluster member is running, then the cluster is not restarted.

The server is a member of a dynamic cluster, the number of running instances does not exceed the minimum value, and the placement controller is disabled

Check the minimum number of instances required for the dynamic cluster by clicking Servers > Dynamic clusters in the administrative console. In this case, health management treats the dynamic cluster like a static cluster, using the minimum number of instances parameter.

The health controller has not received the policy

The health controller does not run on the deployment manager where the health policies are created. If the deployment manager is restarted after the health controller started, the health controller might not have the new policy. You can alleviate this problem by performing the following steps:

Disable the health controller. In the administrative console click Operational policies > Autonomic managers > Health controller.
Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System Administration > Nodes. Select the nodes to synchronize, and click Synchronize.
Restart the health controller. In the administrative console click Operational policies > Autonomic managers > Health controller.
Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System Administration > Nodes. Select the nodes to synchronize, and click Synchronize.
Application placement controller interactions
The following list contains issues that are encountered as a result of the health management and application placement controller interactions:

The server is a member of a dynamic cluster, but the placement controller cannot be contacted

For dynamic cluster members, health monitoring checks with the application placement controller to determine whether a server can be restarted. If the application placement controller is enabled, but cannot be contacted, the following message displays in the log:
WXDH1018E: Could not contact the placement controller Verify that the placement controller is running. To determine where the health controller is running, click Runtime Operations > Extended Deployment > Core components in the administrative console. The location and stability status of the health controller displays. The health controller logs messages to the particular node agent or deployment manager indicated by the current location.

The server is a member of a dynamic cluster, the placement controller is running, and the placement controller instructs health management not to restart the server

The placement controller might require the server instance to remain running.

The server is stopped, but not started.

In a dynamic cluster, a restart can take one of several forms:

Restart in place (stop server, start server).
Start a server instance on another node, and stop the failing one.
Stop the failing server only, assuming that the remaining application instances can satisfy demand.

The placement controller determines which form a restart takes, and if necessary, where to start the new instance. After a restart is performed in a dynamic cluster, health management issues a request to the placement controller to recompute its placement.

Sensor problems
The following list contains issues that are encountered as a result of the health management and node group membership settings:

No sensor data is received for the server.

Health management cannot detect a policy violation if it receives no data from the sensors that are required by the policy. If sensor data is not received during the control cycle, health management prints the following log message:
WXDH3001E: No sensor data received during control cycle from server server_name for health class healthpolicy. For response time conditions, health management receives data from the on demand router (ODR). No data is generated for these conditions until requests are sent through the ODR.

Related concepts

Health management

Related tasks

Configure health management