Introduction and considerations

IBM HACMP for AIX is clustering software that can be used to build a highly available computing environment. It facilitates the detection of failures in hardware, software, and networks. It provides automatic failover of applications and resources from one system to another after a hardware, software, or network failure. This solution reduces downtime by removing single points of failure. We fully tested the WebSphere V5.1 end-to-end high availability solution, as shown in Figure 12-1, which employed HACMP 4.5 and 4.4.x with DB2 7.2.1, Oracle 8i, and Oracle 9i for administrative repository, enterprise application data, persistent session data, log files, JMS providers, and LDAP data. We also tested HACMP/ES for these HA solutions. The major benefits of HACMP/ES arise from the technologies that underlie its heartbeat mechanism supplied in RISC System Cluster Technology (RSCT) and an application monitor.

The unit of failover in HACMP or HACMP/ES is a resource group. A resource group has to contain all the processes and resources needed to deliver a highly available service and ideally should contain only those processes and resources. HACMP for AIX software supports a maximum of up to 20 resource groups per cluster. Three types of resource groups are supported: Cascading, where a resource may be taken over by one or more nodes in a resource chain according to the takeover priority assigned to each node. The available node within the cluster with the highest priority will own the resource. You can also choose to set a flag so that a cascading resource will not fall back to a higher priority owner node when that node re-integrates with the cluster. Rotating, where a resource is associated with a group of nodes and rotates among these nodes. When a node fails, the next available node on its boot address and listed first in the resource chain will acquire the resource group. Concurrent access, where a resource that can be managed by the HACMP for AIX cluster lock manager may be simultaneously shared among multiple applications residing on different nodes.

The cascading resources type provides the best performance, because it ensures that an application is owned by a particular node whenever that node is active in the cluster. This ownership allows the node to cache data the application uses frequently. Rotating resources may minimize the downtime for failover.

In an HACMP cluster, there are public and private networks: A public network connects multiple nodes and allows clients to access these nodes. A private network is a point-to-point connection that links two or more nodes directly.

A network adapter (interface) connects a node to a network. A node typically is configured with at least two network interfaces for each network to which it connects:
A service interface that handles cluster traffic
One or more standby interfaces

The maximum number of network interfaces per node is 24. A service adapter must also have a boot address defined if IP address takeover is enabled. Using a standby adapter eliminates a network adapter as a single point of failure. IP address takeover is an AIX facility that allows one node to acquire the network address of another node in the cluster. To enable IP address takeover, a boot adapter address must be assigned to the service adapter on each cluster node. Nodes use the boot address after a system reboot and before the HACMP for AIX software is started. When the HACMP for AIX software is started on a node, the node's service adapter is reconfigured to use the service address instead of the boot address. If the node should fail, a takeover node acquires the failed node's service address on its standby adapter, making failure transparent to clients using that specific service address. Place standby adapters on a separate subnet.

You can also set a two-node mutual takeover configuration, where both nodes are running different instances of the same application (application, administrative, session, and LDAP databases for example), and are standing by for one another. The takeover node must be aware of the location of specific control files and must be able to access them to perform startup after a failover. It is good practice to put the control file into the local file system of each node. Lay out the application and its data so that only the data resides on shared external disks. This arrangement can prevent software license violations, but it will also simplify failure recovery.

If one node is not available because of a software or hardware upgrade, application failure, or hardware failure, the database server on the failed node is automatically restarted on another node that has access to the same disks that contain the databases. The database on the failing node may be in a transactionally inconsistent state. When the database starts up on the surviving machine, it must go through a crash recovery phase to bring the database back to a consistent state. To maintain data integrity and transactional consistency, the database will not be completely available until the crash recovery has completed.

Failover is the process of having cluster resources move from an unavailable node to an available node. A related process, fallback (or fail back), occurs when resources are restored to the failed node after it is back online. Although you can set up automatic failback once the primary node is available, we suggest that you use manual fallback to minimize the WebSphere service interruption.

In our WebSphere high availability solution, we eliminated single points of failure from the front to the back: network, HA firewalls, HA load balancers, multiple HTTP servers, HTTP plug-in redirection, multiple Web containers, multiple EJB servers, HA LDAP servers and database, HA administrative repository, HA enterprise application data, and HA persistent session data. We integrated, configured, and tested the WebSphere end-to-end high availability system shown in Figure 12-1. It includes HA firewalls for the DMZ, HA WebSphere Edge Components' Load Balancer with a hot standby, multiple HTTP servers with the HTTP plug-in directing requests to Web containers, multiple appservers, HA LDAP servers and database, and finally HA databases with a hot standby database node.

Figure 12-1 WebSphere and HACMP test environment

We tested four variations for this topology: A single WebSphere administrative domain (cell). Two WebSphere administrative domains (cells), with one appserver and its cluster members on each node, for each domain (cell). Two WebSphere administrative domains (cells), but one for EJB servers and the other for servlet servers. Four WebSphere administrative domains (cells), two for EJB servers and two for servlet servers (for this test, we used two more machines).

We used Mercury LoadRunner to simulate customer site situations and to drive heavy load for two-day duration tests, and we initiated failures in all components until we achieved satisfactory availability results. We tested administrative actions, data source lookup, servlets, JSPs, session manager, session EJBs, CMP EJBs, and BMP EJBs individually for each failure point in each component. We also tested comprehensive benchmark applications such as Trade3, Big3, and others.

We initiated temporary failures in the network, Load Balancers, HTTP servers, appservers (both within a node and in different nodes), administrative servers (Node Agent and Deployment Manager), and databases. We determined that the total failed client requests were a very small percentage of total client requests, which depends on frequency of failure initiation. In WebSphere V5, the percentage of failed client requests has been significantly reduced by HA enhancements in many components such as WLM, session management, naming, and system management. Since we have discussed the WebSphere configuration in the previous chapters, we will describe the integration of hardware clustering and WebSphere clustering and focus more on HA data management and its impacts on the whole WebSphere system.

  Prev | Home | Next

 

WebSphere is a trademark of the IBM Corporation in the United States, other countries, or both.

 

IBM is a trademark of the IBM Corporation in the United States, other countries, or both.

 

AIX is a trademark of the IBM Corporation in the United States, other countries, or both.