9.1.5 Enabling instance and HADR with TSA

9.1.5 Enabling instance and HADR with TSA

As the base for automation, the components involved must first be described in a set of RSCT-defined resources. Due to diverse characteristics of resources, there are various RSCT resource classes to accommodate the differences. In a TSA cluster, a resource is any piece of hardware or software that has been defined to IBM Resource Monitoring and Control (RMC). So in this case, the DB2 database instance and the HADR pair of database are both resources in the cluster, which are configured and registered with TSA for automation management. As explained above, every application needs to be defined as a resource to be managed and automated with TSA. Application resources are usually defined in the generic resource class IBM.Application. In this resource class, there are several attributes that define a resource, but at least three of them are application-specific:

StartCommand
StopCommand
MonitorCommand

These commands may be scripts or binary executables. You must ensure that the scripts are well tested and produce the desired effects within a reasonable period of time. This is necessary because these commands are the only interface between TSA and the application.
The automatic package shipped with DB2 v9 includes several scripts that can control the behavior of the DB2 resources defined in a TSA cluster environment. Here is a description of the scripts.

For the DB2 database instance:
regdb2salin: This script registers the DB2 instance into the TSA cluster environment as a resource.
db2_start.ksh, db2_stop.ksh, db2_monitor: These three scripts are registered as part of the TSA resource automation policy. TSA refers to this policy when monitoring DB2 database instances and when responding to predefined events, such as restarting a DB2 database instance when TSA detects that the DB2 database instance has terminated. Example 9-10 is a piece of sample TSA script for the DB2 instance.
Example 9-10 db2_start.ksh
......
function activateDatabase
{
   Resource=db2hadr_${db?}-rs
   hn=$(hostname)
   NodeRG1=$(lsrsrc-api -s IBM.Application::'Name="'${Resource?}'" '::NodeNameList | grep -v $hn | tr "{" " " | tr "}" " " | tr "." " " | awk '{print $1}' | tail -1)
   if [[ ! -z "$NodeRG1" ]]; then
      # HADR database ...
      OpState=$(lsrsrc-api -s IBM.Application::'Name="'${Resource?}'"&& NodeNameList={"'${NodeRG1}'"} '::OpState 2> /dev/null)
      if [[ $OpState == 1 ]]; then
         # HADR is Primary on the other side, start as standby here
         su - ${DB2INSTANCE?} -c "db2 start hadr on db ${db?} as standby"
      else
         su - ${DB2INSTANCE?} -c "db2 activate database ${db?}"
      fi
   else
      # Not HADR database
      su - ${DB2INSTANCE?} -c "db2 restart database ${db?}" &
      sleep 1
   fi
}
......
For the HADR database pair:
reghadrsalin: This script registers the DB2 HADR pair with the TSA environment.
1.xxxx
Register the DB2 instance as a resource that can be managed by the TSA cluster:
(P)# regdb2salin -a db2inst1 -r -l rayden2
(S)# regdb2salin -a db2inst1 -r -l salmon
2.xxxx
Check the status of the resource group for the db2 instance. You can get an output like Figure 9-6.

Figure 9-6 Resource group for DB2 instance in TSA cluster domain
3.xxxx
Register the DB2 HADR pair into the TSA cluster as a specific resource:
(P) reghadrsalin -a db2inst1 -b db2inst1 -d rmall
4.xxxx
Check the status of the resource group of HADR. You can get an output like Figure 9-7.

Figure 9-7 Resource group for HADR in TSA cluster domain
5.xxxx
Check the status of the entire HADR cluster. Figure 9-8 shows the status of the cluster.

Figure 9-8 Status of the entire cluster
hadr_start.ksh, hadr_stop.ksh, hadr_monitor.ksh: These three scripts are registered as part of the TSA automation policy for TSA to monitor and control the behavior of the HADR database pair. Example 9-11 is the script for hadr_start.ksh:
Example 9-11 hadr_start.ksh
......
###########################################################
# starthadr()
###########################################################
starthadr()
{
   set_candidate_P_instance
   instance_to_start=${candidate_P_instance}
   HADR_partner_node_state
   $SVC_PROBE ${DB2HADRINSTANCE1?} ${DB2HADRINSTANCE2?} ${DB2HADRDBNAME?} ${VERBOSE?} S
   rc=$?
   if [[ $remote_node_alive == "Online" ]]; then
      # Bring up HADR as Primary on this node
      if [ $rc -eq 1 ]; then
         # already primary
         rc=0
      elif [ $rc -eq 2 ]; then
         # currently standby,peer
         # takeover (no force)
         logger -i  -p notice -t $0 "su - ${instance_to_start?} -c db2 takeover hadr on db ${DB2HADRDBNAME?}"
         su - ${instance_to_start?} -c "db2 takeover hadr on db ${DB2HADRDBNAME?}" 
         $SVC_PROBE ${DB2HADRINSTANCE1?} ${DB2HADRINSTANCE2?} ${DB2HADRDBNAME?} ${VERBOSE?}
         rc1=$?
         if [ $rc1 -ne 1 ]; then
            :
            logger -i  -p err -t $0 "*** Database ${DB2HADRDBNAME} is in Peer State, TAKEOVER FAILED"
            # Old primary node is still online, offline instance to prevent split-brain
            # Uncomment following 3 lines to allow takeover by force
            #chrg -o Offline -s "Name = '${forceRGOfflineInCaseOfByForce?}'"
            #su - ${instance_to_start?} -c "db2 takeover hadr on db ${DB2HADRDBNAME?} by force"
            #logger -i  -p notice -t $0 "NOTICE: Takeover by force issued, old primary instance offlined to prevent split brain"
         fi
      elif [ $rc -eq 40 ]; then
         :
         logger -i  -p err -t $0 "*** Database ${DB2HADRDBNAME} is not in Peer State, old Primary machine still Online"
         //-ﬁГƒАЃАЃАЃАЃАЃАЃ
         # Uncomment following 3 lines to allow takeover even in case of non Peer Standby w/ old Primary machine Online
         #chrg -o Offline -s "Name = '${forceRGOfflineInCaseOfByForce?}'"
         #su - ${instance_to_start?} -c "db2 takeover hadr on db ${DB2HADRDBNAME?} by force "
         #logger -i  -p notice -t $0 "NOTICE: Takeover by force issued, old primary instance offlined to prevent split brain"
      else
         # current state of HADR is unknown
         # eg. If instance has just gone down, wait until it's 2*monitor period 
         # so that instance can be restarted and db ACTIVATEd
         sleep 20
      fi # Bring up HADR as Primary on this machine
   else
      # Old primary machine is offline
      if [ $rc -eq 2 ]; then
         # Standby is currently in Peer State
         #
         # To bring up standby, will now do a TAKEOVER BY FORCE
         # No need to block until resource group is offline, we have verified
         # that the node is down already
         # To bring up standby, will now do a TAKEOVER BY FORCE
         :
         logger -i  -p notice -t $0 "su - ${instance_to_start?} -c db2 takeover hadr on db ${DB2HADRDBNAME?} by force"
         su - ${instance_to_start?} -c "db2 takeover hadr on db ${DB2HADRDBNAME?} by force "
         logger -i  -p notice -t $0 "NOTICE: Takeover by force issued"
      elif [ $rc -eq 40 ]; then
         # Standby is currently not in Peer State
         :
         logger -i  -p err -t $0 "*** Database ${DB2HADRDBNAME} is not in Peer State, old Primary machine Offline"
         
         # Uncomment following 3 lines to allow takeover even in case of non Peer Standby w/ old Primary machine Online
         //-ﬁГƒАЃАЃАЃАЃАЃАЃ
         #logger -i  -p notice -t $0 "su - ${instance_to_start?} -c db2 takeover hadr on db ${DB2HADRDBNAME?} by force"
         #su - ${instance_to_start?} -c "db2 takeover hadr on db ${DB2HADRDBNAME?} by force "
         #logger -i  -p notice -t $0 "NOTICE: Takeover by force issued"
      fi
   fi # Bring up HADR on this machine
   # Return state 
   $SVC_PROBE ${DB2HADRINSTANCE1?} ${DB2HADRINSTANCE2?} ${DB2HADRDBNAME?} ${VERBOSE?} S
   rcs=$?
   # Online succesful must return 0 whilst monitor returns 1 
   # for Primary in Peer State and 3 for Primary not Peer
   if [ $rcs -eq 1 ]; then
      rc=0
   elif [ $rcs -eq 3 ]; then
      rc=0
      # Anything else, map directly from monitor
   else
      rc=$rcs
   fi
   return $rc
}
......
Take the following steps:
xxxx