Recovering Failed Servers
Contents
- WebLogic Server Failure Recovery Features
- Backing Up Configuration and Security Data
- Restarting Failed Server Instances
Overview
A variety of events can lead to the failure of a server instance. Often one failure condition leads to another. Loss of power, hardware malfunction, operating system crashes, network partitions, and unexpected application behavior can all contribute to the failure of a server instance.
Depending on availability requirements, you may implement a clustered architecture to minimize the impact of failure events. However, even in a clustered environment, server instances may fail periodically, and it is important to be prepared for the recovery process.
WebLogic Server Failure Recovery Features
Automatic Restart for Managed Servers
Selected subsystems within each WebLogic Server instance monitor their health status. For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics. If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as "failed" with the host server.
Each WLS instance, in turn, checks the health state of its registered subsystems to determine its overall viability. If one or more of its critical subsystems have reached the FAILED state, the server instance marks its own health state FAILED to indicate that it cannot reliably host an application.
When used in combination with Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed. This improves the overall reliability of a domain, and requires no direct intervention from an administrator.
To configure Node Manager and automatic restart behaviors, see Configuring Node Manager.
Managed Server Independence Mode
When a Managed Server starts, it tries to contact the Administration Server to retrieve its configuration information. If a Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading configuration and security files directly. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode. By default, MSI mode is enabled. For information about disabling MSI mode, see Disabling Managed Server Independence" in Administration Console Online Help.
In Managed Server Independence mode, a Managed Server looks in its root directory for the following files:
- msi-config.xml - a replica of the domain's config.xml.(Even if the domain's configuration file is named something other than config.xml, a Managed Server in MSI mode always looks for a configuration file named msi-config.xml.)
- SerializedSystemIni.dat
- boot.properties - an optional file that contains an encrypted version of your username and password. For more information, see Boot Identity Files" in Administration Console Online Help.
MSI Mode and the Managed Servers Root Directory
By default, a server instance assumes that its root directory is the directory from which it was started. For more information about a server's root directory, refer to A Server's Root Directory.
If you enable replication of configuration data, as described in Backing Up Security Data, and if you have started the Managed Server at least once while the Administration Server was running, msi-config.xml and SerializedSystemIni.dat will already be in the server's root directory. The boot.properties file is not replicated. If it is not already in the Managed Server's root directory, create one. For more information, see "Boot IdentityFiles" in Administration Console Online Help.
If msi-config.xml and SerializedSystemIni.dat are not in the root directory, you can either:
- Copy config.xml and SerializedSystemIni.dat from the Administration Server's root directory (or from a backup) to the Managed Server's root directory. Then, rename the configuration file to msi-config.xml, or
- Use the -Dweblogic.RootDirectory=path startup option to specify a directory that already contains these files.
MSI Mode and the Security Realm
A Managed Server must have access to a security realm to complete its startup process.
If you use the security realm that WLS installs, then the Administration Server maintains an LDAP server to store the domain's security data. All Managed Servers replicate this LDAP server. If the Administration Server fails, Managed Servers running in MSI mode use the replicated LDAP server for security services.
If you use a third party security provider, then the Managed Server must be able to access the security data before it can complete its startup process.
MSI Mode and SSL
If you set up SSL for your servers, each server requires its own set of certificate files, key files, and other SSL-related files. Managed Servers do not retrieve SSL-related files from the Administration Server (though the domain's configuration file does store the pathnames to those files for each server). Starting in MSI Mode does not require you to copy or move the SSL-related files unless they are located on a machine that is inaccessible.
MSI Mode and Deployment
A Managed Server that starts in MSI mode deploys its applications from its staging directory: serverroot/stage/appName.
MSI Mode and Managed Server Configuration Changes
If you start a Managed Server in MSI mode, you cannot change its configuration until it restores communication with the Administration Server.
MSI Mode and Node Manager
You cannot use Node Manager to start a server instance in MSI mode, only to restart it. For a routine startup, Node Manager requires access to the Administration Server. If the Administration Server is unavailable, log onto Managed Server's host machine to start the Managed Server.
MSI Mode and Configuration File Replication
Managed Server Independence mode includes an option that copies the required configuration files into the Managed Server's root directory every 5 minutes. This option does not replicate a boot identity file. (For more information about boot identity files, see Boot Identity Files" in Administration Console Online Help.)
By default, a Managed Server does not replicate these files. Depending on your backup schemes and the frequency with which you update your domain's configuration, this option might not be worth the performance cost of copying potentially large files across a network.
To enable a Managed Server to replicate the domain's configuration files, see Replicating a Domain's Configuration Files for Managed Server Independence" in Administration Console Online Help.
MSI Mode and Restored Communication with an Administration Server
When the Administration Server starts, it can detect the presence of running Managed Servers (if -Dweblogic.management.discover=true, which is the default setting for this property).
Upon startup, the Administration Server looks at a persisted copy of the file running-managed-servers.xml and notifies all the Managed Servers listed in the file of its presence.
Managed Servers that were started in Managed Server Independence Mode while the Administration Server was unavailable will not appear in running-managed-servers.xml. To re-establish a connection between the Administration Server and such Managed Servers, use the weblogic.Admin DISCOVERMANAGEDSERVER command. See "DISCOVERMANAGEDSERVER" in WLS Command Reference.
When an Administration Server starts up and contacts a Managed Server running in MSI mode, the Managed Server deactivates MSI mode and registers itself to the Administration Server for future configuration change notifications.
Backing Up Configuration and Security Data
Recovery from the failure of a server instance requires access to the domain's configuration and security data. This section describes file backups that WLS performs automatically, and recommended backup procedures that an administrator should perform.
Backing up config.xml
By default, an Administration Server stores a domain's configuration data in a file called domain_name\config.xml, where domain_name is the root directory of the domain.
Back up config.xml to a secure location in case a failure of the Administration Server renders the original copy unavailable. If an Administration Server fails, you can copy the backup version to a different machine and restart the Administration Server on the new machine.
WLS Archives Previous Versions of config.xml
By default, the Administration Server archives up to 5 previous versions of config.xml in the domain-name/configArchive directory.
When you save a change to a domain's configuration, the Administration Server saves the previous configuration in domain-name\configArchive\config.xml#n. Each time the Administration Server saves a file in the configArchive directory, it increments the value of the #n suffix, up to a configurable number of copies - 5 by default. Thereafter, each time you change the domain configuration:
- The archived files are rotated so that the newest file has a suffix with the highest number,
- The previous archived files are renamed with a lower number, and
- The oldest file is deleted.
Example of Archived config.xml Naming and Rotation
In the MedRec domain, the current configuration file used by the MedRecServer is WL_HOME\samples\domains\medrec\config.xml. If you add a server instance using the Administration Console, when you click the Create button, MedRecServer saves the old config.xml file as WL_HOME\samples\domains\medrec\configArchive\config.xml#2.
The new file, WL_HOME\samples\domains\medrec\config.xml, represents the MedRec domain with the new server instance. The previous file, WL_HOME\samples\domains\medrec\configArchive\config.xml#2, contains the MedRec domain configuration as it was prior to creation of the new server instance.
The next time you change the configuration, MedRecServer saves the current config.xml file as config.xml#3. After four changes to the domain, the configArchive directory contains four files: config.xml#2, config.xml#3, config.xml#4, config.xml#5. The next time you change the configuration, MedRecServer saves the old config.xml as config.xml#5. The previous config.xml#5 is renamed as config.xml#4, and so on. The old config.xml#2 is deleted.
Configuring the Number of Archived config.xml Versions
To configure how many previous versions of the domain configuration are archived:
- In the left pane of the Administration Console, click on the name of the domain.
- In the right pane, click the Configuration->General tab.
- In the Advanced Options bar, click Show.
- In the Archive Configuration Count box, enter the number of versions to save.
- Click Apply.
WLS Archives config.xml during Server Startup
In addition to the files in domain-name\configArchive, the Administration Server creates two other files that back up the domain's configuration at key points during the startup process:
- domain-name\config-file.xml.original - The configuration file just before the Administration Server parses it and adds subsystem data.
- domain-name\config-file.xml.booted - The configuration file just after the Administration Server successfully boots. If the config.xml becomes corrupted, you can boot the Administration Server with this file.
Example of Archives of config.xml During Startup
If your domain configuration is stored in config.xml, when you start the domain's Administration Server, the Administration Server:
- Copies config.xml to config.xml.original.
- Parses config.xml. Depending on the domain configuration, some WebLogic subsystems add configuration information to config.xml. For example, the Security service adds MBeans and encrypted data for SSL communication.
- Copies the parsed and modified config.xml to MyConfig.xml.booted.
The Administration Server uses the parsed and modified config.xml. When you update the domain's configuration, it copies the old config.xml to domain-name\configArchive\MyConfig.xml#2.
Backing Up Security Data
The WebLogic Security service stores its configuration data config.xml file, and also in an LDAP repository and other files.
Backing Up the WebLogic LDAP Repository
The default Authentication, Authorization, Role Mapper, and Credential Mapper providers that are installed with WLS store their data in an LDAP server. Each WLS contains an embedded LDAP server. The Administration Server contains the master LDAP server which is replicated on all Managed Servers. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:
domain_name\adminServer\ldap
where domain_name is the domain's root directory and adminServer is the directory in which the Administration Server stores runtime and security data.
Each WebLogic Serve has an LDAP directory, but you only need to back up the LDAP data on the Administration Server - the master LDAP server replicates the LDAP data from each Managed Server when updates to security data are made. WebLogic security providers cannot modify security data while the domain's Administration Server is unavailable. The LDAP repositories on Managed Servers are replicas and cannot be modified.
The ldap/ldapfiles subdirectory contains the data files for the LDAP server. The files in this directory contain user, group, group membership, policies, and role information. Other subdirectories under the ldap directory contain LDAP server message logs and data about replicated LDAP servers.
Do not update the configuration of a security provider while a backup of LDAP data is in progress. If a change is made - for instance, if an administrator adds a user - while you are backing up the ldap directory tree, the backups in the ldapfiles subdirectory could become inconsistent. If this does occur, consistent, but potentially out-of-date, LDAP backups are available, as described in WLS Backs Up LDAP Files.
WLS Backs Up LDAP Files
Once a day, a server suspends write operations and creates its own backup of the LDAP data. It archives this backup in a ZIP file below the ldap\backup directory and then resumes write operations. This backup is guaranteed to be consistent, but it might not contain the latest security data.
For information about configuring the LDAP backup, see Configuring Backups for the Embedded LDAP Server" in Administration Console Online Help.
Backing Up SerializedSystemIni.dat and Security Certificates
All servers create a file named SerializedSystemIni.dat and locate it in the server's root directory. This file contains encrypted security data that must be present to boot the server. You must back up this file.
If you configured a server to use SSL, also back up the security certificates and keys. The location of these files is user-configurable.
Restarting Failed Server Instances
The nature of your applications and user demand determine the steps you take to restore application service. In particular, these factors influence the recovery process:
- Was the failed server instance an Administration Server or a Managed Server?
- Can you restart the failed server instance on same machine upon which it was running when it failed?
- What are the network conditions when you restart the server instance? Can the service instance you are restarting establish communications with its Administration Server?
- Was the server instance that failed the active host for a migratable service in a WLS cluster?
- Were any changes made to the domain configuration made while the failed server instance was down?
- Was the domain configuration corrupted?
Restarting an Administration Server
The following sections describe how to start an Administration Server after a failure.
Restarting an Administration Server When Managed Servers Not Running
If no Managed Servers in the domain are running when you restart a failed Administration Server, no special steps are required. Start the Administration Server as you normally do. See Starting and Stopping Servers" in Administration Console Online Help.
Restarting an Administration Server When Managed Servers Are Running
If the Administration Server shuts down while Managed Servers continue to run, you do not need to restart the Managed Servers that are already running in order to recover management of the domain. The procedure for recovering management of an active domain depends upon whether you can restart the Administration Server on the same machine it was running on when the domain was started.
Restarting an Administration Server on the Same Machine
If you restart the WebLogic Administration Server while Managed Servers continue to run, by default the Administration Server can discover the presence of the running Managed Servers.
Note: Make sure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers. For more information about -Dweblogic.management.discover, see Server Communication" in weblogic.Server Command-Line Reference.
The root directory for the domain contains a file running-managed-servers.xml which contains a list of the Managed Servers in the domain and whether they are running or not. When the Administration Server restarts, it checks this file to determine which Managed Servers were under its control before it stopped running.
When a Managed Server is gracefully or forcefully shut down, its status in running-managed-servers.xml is updated to "not-running". When an Administration Server restarts, it does not try to discover Managed Servers with the "not-running" status. A Managed Servers that stops running because a system crash, or that was stopped by killing the JVM or the command prompt (shell) in which it was running, will still have the status "running' in running-managed-servers.xml. The Administration Server will attempt to discover them, and will throw an exception when it determines that the Managed Server is no longer running.
Restarting the Administration Server does not cause Managed Servers to update the configuration of static attributes. Static attributes are those that a server refers to only during its startup process. Servers instances must be restarted to take account of changes to static configuration attributes. Discovery of the Managed Servers only enables the Administration Server to monitor the Managed Servers or make runtime changes in attributes that can be configured while a server is running (dynamic attributes).
Restarting an Administration Server on Another Machine
If a machine crash prevents you from restarting the Administration Server on the same machine, you can recover management of the running Managed Servers as follows:
- Install the WLS software on the new administration machine (if this has not already been done).
- Make your application files available to the new Administration Server by copying them from backups or by using a shared disk. Your application files should be available in the same relative location on the new file system as on the file system of the original Administration Server.
- Make your configuration and security data available to the new administration machine by copying them from backups or by using a shared disk. For more information, refer to Backing Up Configuration and Security Data.
- Restart the Administration Server on the new machine.
Make sure that the startup command or startup script does not include -Dweblogic.management.discover=false, which disables an Administration Server from discovering its running Managed Servers. For more information about -Dweblogic.management.discover, see Server Communication" in weblogic.Server Command-Line Reference.
When the Administration Server starts, it communicates with the Managed Servers and informs them that the Administration Server is now running on a different IP address.
Restarting Managed Servers
The following sections describe how to start Managed Servers after failure. For recovery considerations related to transactions and JMS, see Additional Failure Topics.
Starting a Managed Server When the Administration Server Is Accessible
If the Administration Server is reachable by Managed Server that failed, you can:
- Restart it manually or automatically using Node Manager - You must configure Node Manager and the Managed Server to support this behavior. For details, see Configure Monitoring, Shutdown, and Restart for Managed Servers.
- Start it manually with a command or script - For instructions, see Starting and Stopping Servers" in Administration Console Online Help.
Starting a Managed Server When the Administration Server Is Not Accessible
If a Managed Server cannot connect to the Administration Server during startup, it can retrieve its configuration by reading locally cached configuration data. A Managed Server that starts in this way is running in Managed Server Independence (MSI) mode. For a description of MSI mode, and the files that a Managed Server must access to start up in MSI mode, see Managed Server Independence Mode.
Note: If the Managed Server that failed was a clustered Managed Server that was the active server for a migratable service at the time of failure, perform the steps described in Migrating When the Currently Active Host is Unavailable" in Using WLS Clusters. Do not start the Managed Server in MSI mode.
To start up a Managed Server in MSI mode:
- Ensure that the following files are available in the Managed Server's root directory:
- msi-config.xml.
- SerializedSystemIni.dat
- boot.properties
If these files are not in the Managed Server's root directory:
- Copy the config.xml and SerializedSystemIni.dat file from the Administration Server's root directory (or from a backup) to the Managed Server's root directory.
- Rename the configuration file to msi-config.xml. When you start the server, it will use the copied configuration files.
Note: Alternatively, you can use the -Dweblogic.RootDirectory=path startup option to specify a root directory that already contains these files.
- Start the Managed Server at the command line or using a script.
The Managed Server will run in MSI mode until it is contacted by its Administration Server. For information about restarting the Administration Server in this scenario, see Restarting an Administration Server When Managed Servers Are Running.
Additional Failure Topics
For information related to recovering JMS data from a failed server instance, see Configuring JMS Migratable Targets" in Programming WebLogic JMS.
For information about transaction recovery after failure, see Moving a Server to Another Machine" and "Transaction Recovery After a Server Fails in Administration Console Online Help.