Health management

Health management

The health management subsystem continuously monitors the state of servers and the work performed by the servers in the environment. The health management subsystem consists of two main elements:

Health controller
Autonomic manager that acts on our health policies. The health controller is a distributed resource managed by the high availability manager, and exists within all node agent and dmgr processes. The health controller is active in one of these processes. If the active process fails, the health controller can become active on another node agent or dmgr process. The health controller runs on a control cycle that defines the amount of time between environment checks. At the end of the control cycle, the health controller checks the environment and generates runtime tasks to resolve any breaches in the health conditions.

Health policies
Defines health conditions to monitor in the environment, and the health actions to take if these conditions are not met. We can disable or enable health management using the health controller, while still having multiple health policies defined on the system. We can limit the server restart frequency or prohibit restarts during certain periods.

The health management subsystem functions when Intelligent Management is in automatic or supervised operating mode. When the reaction mode on the policy is set to automatic, the health management system takes action when a health policy violation is detected. In supervised mode, the health management system creates a runtime task that offers one or more reactions. The system administrator can approve or deny the proposed actions.

Health conditions

Health conditions define the variables to monitor in the environment. Several categories of health policy conditions exist.

Age-based condition

Tracks the amount of time that the server is running. If the amount of time exceeds the defined threshold, the health actions run.

Excessive request timeout condition

Percentage of HTTP requests that can time out. When the percentage of requests exceeds the defined value, the health actions run.

Excessive response time condition

Tracks the amount of time that requests take to complete. If the time exceeds the defined response time threshold, the health actions run. Requests that exceed the timeout threshold set not included in the excessive response time calculations. For example, if the default timeout value of 60 seconds is in effect then any requests that exceed that threshold and timeout are not included in the calculations for excessive response time. This restriction applies even if we do not have the excessive request timeout health condition defined in the environment.

Memory condition: excessive memory usage

Tracks the memory usage for a member. When the memory usage exceeds a percentage of the heap size for a specified time, health actions run to correct this situation.

Memory condition: memory leak

Tracks consistent downward trends in free memory available to a server in the Java heap. When the Java heap approaches the maximum configured size, we can perform either heap dumps or server restarts.

Storm drain condition

Tracks requests that have a significantly decreased response time. This policy relies on change point detection on given time series data.

Workload condition

Number of requests serviced before policy members restart to clean out memory and cache data.

Garbage collection percentage condition

Monitors a JVM or set of JVMs to determine whether they spend more than a defined percentage of time in garbage collection during a specified time period.

For more information about these conditions, click the help icon on the Define health policy general properties panel in the administrative console.
With these predefined health policy conditions, actions have been taken to optimize the distribution of the needed data, minimize the impact of monitoring, and enforce the health policy in the environment.
We can also define custom conditions for our health policy if the predefined health conditions do not fit our needs. We define custom conditions as a subexpression that is tested against metrics in the environment. When defining a custom condition, consider the cost of collecting the data, analyzing the data, and if needed, enforcing the health policy. This cost can increase depending on the amount of traffic and number of servers in the network. Analyze the performance of our custom health conditions before we use them in production.
Example:
PMIMetric_FromServerStart$systemModule$cpuUtilization > 90L

Health actions

Health actions define the process to use when a health condition is not met. Depending on the conditions defined, the actions can vary. The following table lists the health actions supported in various server environments:

Health action WebSphere application servers that run in the same Intelligent Management cell Other middleware servers (including external WebSphere application servers)
Restart server Supported Supported
Take thread dumps Supported Not supported
Take JVM heap dumps Supported for servers running on the IBM SDK Not supported
Put server into maintenance mode Supported Supported
Put server into maintenance mode and break HTTP and SIP request affinity to the server Supported Supported
Take server out of maintenance mode Supported Supported
Generate an SNMP trap Supported Supported

In a dynamic cluster, a restart can take one of several forms:

Restart in place (stop server, start server). This restart always occurs when a dynamic cluster is in manual mode.
Start a server instance on another node, and stop the failing one.
Stop the failing server only, assuming that the remaining application instances can satisfy demand.

We can also define a custom action. With a custom action, we define an executable file to run when the health condition breaches. We must define custom actions before creating the health policy containing the custom actions.

Health policy targets

Health policy targets can be a single server, each of the servers in a cluster or dynamic cluster, the on demand router, or each of the servers in a cell. We can define multiple health policies to monitor the same set of servers. If we are using predefined health conditions, the support varies depending on the server type. Certain middleware servers do not support all of the policy types. The following table summarizes the health policy support, by server type:

Predefined health policy Servers that run in the same Intelligent Management cell Other middleware servers (including external WebSphere application servers)
Age-based policy Supported Supported
Workload policy Supported Supported
Memory leak detection Supported Not supported
Excessive memory usage Supported Supported for WAS Community Edition servers. Not supported for other middleware server types.
Excessive request timeout Supported Supported for other middleware servers to which the ODR routes requests.
Excessive response time Supported Supported
Storm drain detection Supported Supported
Garbage collection percentage Supported Not supported

Default health policies

Create default health policies using predefined health conditions installed with the product.

To create a default health policy, click...
Operational policies | Health policies | New

Select one of the predefined health conditions.

Because the default health policies monitor each server in supervised mode, we can use these policies to prevent health problems. In addition to the default policies, we can define policies with more detailed settings or automated mode operation for particular servers or collections of servers. The following list shows the default cell-wide health policies that we can create using the predefined health conditions:

Default memory leak Default standard detection level. The default memory leak health policy uses the performance advisor function. The performance advisor is enabled when this policy is enabled. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, keep the default memory leak policy, but remove all members. To change the members, click...
Operational policies | Health policies | Default_Memory_Leak

We can edit the health policy memberships by adding and removing members from the policy.
Default excessive memory usage Set to 95 percent of the JVM heap size for 15 minutes
Default excessive request timeout Set for 5 percent of the requests timing out
Default excessive response time Set to 120 seconds
Default storm drain Default standard detection level
Garbage collection percentage Set to 10 percent. The default sampling time is 2 minutes.

To view the recommendations made by default health policies and to take actions on these recommendations, click...
System administration > Task management > Runtime tasks.

Related:

Excessive request timeout health policy target timeout value
Configure health management
Create health policies
Set maintenance mode
Create health policy custom actions
Manage runtime tasks

Health action	WebSphere application servers that run in the same Intelligent Management cell	Other middleware servers (including external WebSphere application servers)
Restart server	Supported	Supported
Take thread dumps	Supported	Not supported
Take JVM heap dumps	Supported for servers running on the IBM SDK	Not supported
Put server into maintenance mode	Supported	Supported
Put server into maintenance mode and break HTTP and SIP request affinity to the server	Supported	Supported
Take server out of maintenance mode	Supported	Supported
Generate an SNMP trap	Supported	Supported

Predefined health policy	Servers that run in the same Intelligent Management cell	Other middleware servers (including external WebSphere application servers)
Age-based policy	Supported	Supported
Workload policy	Supported	Supported
Memory leak detection	Supported	Not supported
Excessive memory usage	Supported	Supported for WAS Community Edition servers. Not supported for other middleware server types.
Excessive request timeout	Supported	Supported for other middleware servers to which the ODR routes requests.
Excessive response time	Supported	Supported
Storm drain detection	Supported	Supported
Garbage collection percentage	Supported	Not supported

Default memory leak	Default standard detection level. The default memory leak health policy uses the performance advisor function. The performance advisor is enabled when this policy is enabled. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, keep the default memory leak policy, but remove all members. To change the members, click... Operational policies \| Health policies \| Default_Memory_Leak We can edit the health policy memberships by adding and removing members from the policy.
Default excessive memory usage	Set to 95 percent of the JVM heap size for 15 minutes
Default excessive request timeout	Set for 5 percent of the requests timing out
Default excessive response time	Set to 120 seconds
Default storm drain	Default standard detection level
Garbage collection percentage	Set to 10 percent. The default sampling time is 2 minutes.