1.2.4 Possible single points of failure in the WebSphere system

Table 1-2 lists potential single points of failure in the WebSphere system and possible solutions.


Failure point	Possible solutions
Client access	Multiple ISPs.
Firewalls	Firewall clustering, firewall sprayer, HA firewall.
Caching Proxy	Backup Caching Proxy system.
HTTP sprayer (such as WebSphere Edge Components' Load Balancer)	HA solution of vendor, for example backup Load Balancer server.
Web server	Multiple Web servers with network sprayer, hardware-based clustering.
WebSphere master repository data, log files	HA shared file system, Network File System (NFS), hardware based clustering.
WAS	WAS ND - appserver clustering: Horizontal Vertical Combination of both Additionally for EJBs: backup cluster.
WebSphere Node Agent	Multiple Node Agents in the cluster, OS service, hardware-based clustering. The Node Agent is not considered a single point of failure in WebSphere V6. The Node Agent must be running when starting the appserver on that node so the appserver can register with the Location Service Daemon (LSD). In WebSphere V6 the LSD is HAManager enabled therefore you only need one running Node Agent in the cluster to provide the LSD when the appservers are started on the node. The Node Agent must also be running when changing security related configuration or you might not be able to synchronize with the Deployment Manager later on any more. Refer to Chapter 3, WebSphere administrative process failures for more information.
WebSphere Deployment Manager	OS service, hardware-based clustering, backup WebSphere cell. The Deployment Manager is not considered a single point of failure in WebSphere V6. We need it to configure your WebSphere environment, to monitor performance using the Tivoli Performance Viewer, or to use backup cluster support. Unless these functions are needed, you can run a production environment without an active Deployment Manager. Refer to Chapter 3, WebSphere administrative process failures for more information.
Entity EJBs, application DB	HA DBs, parallel DBs. Make sure your application catches StaleConnectionException and retries, see 15.4, Database server for more information.
Default messaging provider	WebSphere appserver clustering: HAManager provides failover.
Default messaging provider data store	Clustering, data replication, parallel database.
Application database	Clustering, data replication, parallel database.
Session database	Memory-to-memory replication, DB clustering, parallel database.
Transaction logs	WebSphere appserver clustering: HAManager provides failover, shared file system with horizontal clustering.
WebSphere MQ	WebSphere MQ cluster, combination of WebSphere MQ cluster and clustering.
LDAP	Master-replica, sprayer, HA LDAP (clustering).
Internal network	Dual internal networks.
Hubs	Multiple interconnected network paths.
Disk failures, disk bus failure, disk controller failure	Disk mirroring, RAID-5, multiple buses, multiple disk controllers.
Network service failures (DNS, ARP, DHCP, and so forth)	Multiple network services.
OS or other software crashes	Clustering, switch automatically to a healthy node.
Host dies	WebSphere appserver clustering, hardware-based clustering: automatically switch to a healthy node.
Power outages	UPS, dual-power systems.
Room/floor disaster (fire, flood, and so forth)	Systems in different rooms/different floors.
Building disasters (fire, flood, tornado, and so forth)	Systems in different buildings.
City disasters (earthquake, flood, and so forth)	Remote mirror, replication, geographical clustering.
Region disasters	Put two data centers far away with geographical clustering or remote mirroring.
Human error	Train people, simplify system management, use clustering and redundant hardware/software.
Software and hardware upgrades	Rolling upgrades with clustering or WLM for 7x24x365, planned maintenance for others.

Possible single points of failure in the WebSphere system

ibm.com/redbooks