WebSphere eXtreme Scale Product Overview > Availability > Replication for availability



Use zones for replica placement


Zone support allows sophisticated configurations for replica placement across data centers. With this capability, grids of thousands of partitions can be managed using a handful of optional placement rules. A data center can be different floors of a building, different buildings, or even different cities or other distinctions as configured with zone rules.


Flexibility of zones

You can place shards into zones. This function allows you to have more control over how eXtreme Scale places shards on a grid. Java™ virtual machines that host an eXtreme Scale server can be tagged with a zone identifier. The deployment file can now include one or more zone rules and these zone rules are associated with a shard type. The best way to explain this is with some examples followed by more detail.

Placement zones control of how the eXtreme Scale assigns out primaries and replicas to configure advanced topologies.

A Java virtual machine can have multiple containers but only 1 server. A container can host multiple shards from a single ObjectGrid.

This capability is useful to make sure that replicas and primaries are placed in different locations or zones for better high availability. Normally, eXtreme Scale does not place a primary and replica shard on Java virtual machines with the same IP address. This simple rule normally prevents two eXtreme Scale servers from being placed on the same physical computer. However, you might require a more flexible mechanism. For example, you might be using two blade chassis and want the primaries to be striped across both chassis and the replica for each primary be placed on the other chassis from the primary.

Striped primaries means that primaries are placed into each zone and the replica for each primary is located in the opposite zone. For example primary 0 would be in zoneA, and sync replica 0 would be in zoneB. Primary 1 would be in zoneB, and sync replica 1 would be in zoneA.

The chassis name would be the zone name in this case. Alternatively, you might name zones after floors in a building and use zones to make sure that primaries and replicas for the same data are on different floors. Buildings and data centers are also possible. Testing has been done across data centers using zones as a mechanism to ensure the data is adequately replicated between the data centers. If you are using the HTTP Session Manager for eXtreme Scale, you can also use zones. With this feature, you can deploy a single Web application across three data centers and ensure that HTTP sessions for users are replicated across data centers so that the sessions can be recovered even if an entire data center fails.

WebSphere eXtreme Scale is aware of the need to manage a large grid over multiple data centers. It can make sure that backups and primaries for the same partition are located in different data centers if that is required. It can put all primaries in data center 1 and all replicas in data center 2 or it can round robin primaries and replicas between both data centers. The rules are flexible so that many scenarios are possible. eXtreme Scale can also manage thousands of servers, which together with completely automatic placement with data center awareness makes such large grids affordable from an administrative point of view. Administrators can specify what they want to do simply and efficiently.

As an administrator, use placement zones to control where primary and replica shards are placed, which allows for the set up of advanced high performance and highly available topologies. You can define a zone to any logical grouping of eXtreme Scale processes, as noted above: These zones can correspond to physical workstation locations such as a data center, a floor of a data center, or a blade chassis. You can stripe data across zones, which provides increased availability, or you can split the primaries and replicas into separate zones when a hot standby is required.


Associate an eXtreme Scale server with a zone that is not using WebSphere Extended Deployment

If eXtreme Scale is used with Java Standard Edition or an application server that is not based on WebSphere Extended Deployment v6.1, then a JVM that is a shard container can be associated with a zone if using the following techniques.


Applications using the startOgServer script

The startOgServer script is used to start an eXtreme Scale application when it is not being embedded in an existing server. The -zone parameter is used to specify the zone to use for all containers within the server.


Specify the zone when starting a container using APIs

The zone name for a container can be specified as described in the documentation for Embedded server API.


Associate WebSphere Extended Deployment nodes with zones

If you are using eXtreme Scale with WebSphere Extended Deployment JEE applications, you can leverage WebSphere Extended Deployment node groups to place servers in specific zones.

In eXtreme Scale, a JVM is only allowed to be a member of a single zone. However, WebSphere allows a node to be a part of multiple node groups. You can use this functionality of eXtreme Scale zones if you ensure that each of the nodes is in only one zone node group.

Use the following syntax to name the node group in order to declare it a zone: ReplicationZone<UniqueSuffix>. Servers running on a node that is part of such a node group are included in the zone specified by the node group name. The following is a description of an example topology.

First, you configure 4 nodes: node1, node2, node3, and node4, each node having 2 servers. Then you create a node group named ReplicationZoneA and a node group named ReplicationZoneB. Next, you add node1 and node2 to ReplicationZoneA and add node3 and node4 to ReplicationZoneB.

When the servers on node1 and node2 are started, they will become part of ReplicationZoneA, and likewise the servers on node3 and node4 will become part of ReplicationZoneB.

A grid-member JVM checks for zone membership at startup only. Adding a new node group or changing the membership only has an impact on newly started or restarted JVMs.


Zone rules

An eXtreme Scale partition has one primary shard and zero or more replica shards. For this example, consider the following naming convention for these shards. P is the primary shard, S is a synchronous replica and A is an asynchronous replica. A zone rule has three components:

A zone rule specifies the possible set of zones in which a shard can be placed. The inclusive flag indicates that after a shard is placed in a zone from the list, then all other shards are also placed in that zone. An exclusive setting indicates that each shard for a partition is placed in a different zone in the zone list. For example, using an exclusive setting means that if there are three shards (primary, and two synchronous replicas), then the zone list must have three zones.

Each shard can be associated with one zone rule. A zone rule can be shared between two shards. When a rule is shared then the inclusive or exclusive flag extends across shards of all types sharing a single rule.


Examples

A set of examples showing various scenarios and the deployment configuration to implement the scenarios follows.


Striping primaries and replicas across zones

You have three blade chassis, and want primaries distributed across all three, with a single synchronous replica placed in a different chassis than the primary. Define each chassis as a zone with chassis names ALPHA, BETA, and GAMMA. An example deployment XML follows:

<?xml version="1.0" encoding="UTF-8"?>
<deploymentPolicy xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance 
    xsi:schemaLocation=
    "http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
                xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">
        <objectgridDeployment objectgridName="library">
            <mapSet name="ms1" numberOfPartitions="37" minSyncReplicas="1"
                maxSyncReplicas="1" maxAsyncReplicas="0">
            <map ref="book" />
            <zoneMetadata>
                <shardMapping shard="P" zoneRuleRef="stripeZone"/>
                <shardMapping shard="S" zoneRuleRef="stripeZone"/>
                <zoneRule name ="stripeZone" exclusivePlacement="true" >
                    <zone name="ALPHA" />
                    <zone name="BETA" />
                    <zone name="GAMMA" />
                </zoneRule>
            </zoneMetadata>
        </mapSet>
    </objectgridDeployment>
</deploymentPolicy>

This deployment XML contains a grid called library with a single Map called book. It uses four partitions with a single synchronous replica. The zone metadata clause shows the definition of a single zone rule and the associate of zone rules with shards. The primary and synchronous shards are both associated with the zone rule "stripeZone". The zone rule has all three zones in it and it uses exclusive placement. This rule means that if the primary for partition 0 is placed in ALPHA then the replica for partition 0 will be placed in either BETA or GAMMA. Similarly, primaries for other partitions are placed in other zones and the replicas will be placed.


Asynchronous replica in a different zone than primary and synchronous replica

In this example, two buildings exist with a high latency connection between them. You want no data loss high availability for all scenarios. However, the performance impact of synchronous replication between buildings leads you to a trade off. You want a primary with synchronous replica in one building and an asynchronous replica in the other building. Normally, the failures are JVM crashes or computer failures rather than large scale issues. With this topology, you can survive normal failures with no data loss. The loss of a building is rare enough that some data loss is acceptable in that case. You can make two zones, one for each building. The deployment XML file follows:

<?xml version="1.0" encoding="UTF-8"?>
<deploymentPolicy xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
        xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">

    <objectgridDeployment objectgridName="library">
        <mapSet name="ms1" numberOfPartitions="13" minSyncReplicas="1"
            maxSyncReplicas="1" maxAsyncReplicas="1">
            <map ref="book" />
            <zoneMetadata>
                <shardMapping shard="P" zoneRuleRef="primarySync"/>
                <shardMapping shard="S" zoneRuleRef="primarySync"/>
                <shardMapping shard="A" zoneRuleRef="aysnc"/>
                <zoneRule name ="primarySync" exclusivePlacement="false" >
                        <zone name="BldA" />
                        <zone name="BldB" />
                </zoneRule>
                <zoneRule name="aysnc" exclusivePlacement="true">
                        <zone name="BldA" />
                        <zone name="BldB" />
                </zoneRule>
            </zoneMetadata>
        </mapSet>
    </objectgridDeployment>
</deploymentPolicy>

The primary and synchronous replica share a primaySync zone rule with an exclusive flag setting of false. So, after the primary or sync gets placed in a zone, then the other is also placed in the same zone. The asynchronous replica uses a second zone rule with the same zones as the primarySync zone rule but it uses the exclusivePlacement attribute set to true. This attribute indicates that means a shard cannot be placed in a zone with another shard from the same partition. As a result, the asynchronous replica does not get placed in the same zone as the primary or synchronous replicas.


Place all primaries in one zone and all replicas in another zone

Here, all primaries are in one specific zone and all replicas in a different zone. We will have a primary and a single asynchronous replica. All replicas will be in zone A and primaries in B.

    <?xml version="1.0" encoding="UTF-8"?>

    <deploymentPolicy xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation=
            "http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
        xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">

        <objectgridDeployment objectgridName="library">
            <mapSet name="ms1" numberOfPartitions="13" minSyncReplicas="0"
                maxSyncReplicas="0" maxAsyncReplicas="1">
                <map ref="book" />
                <zoneMetadata>
                    <shardMapping shard="P" zoneRuleRef="primaryRule"/>
                    <shardMapping shard="A" zoneRuleRef="replicaRule"/>
                    <zoneRule name ="primaryRule">
                        <zone name="A" />
                    </zoneRule>
                    <zoneRule name="replicaRule">
                        <zone name="B" />
                            </zoneRule>
                        </zoneMetadata>
                    </mapSet>
            </objectgridDeployment>
    </deploymentPolicy>

Here, you can see two rules, one for the primaries (P) and another for the replica (A).


Zones over wide area networks (WAN)

You might want to deploy a single eXtreme Scale over multiple buildings or data centers with slower network interconnections. Slower network connections lead to lower bandwidth and higher latency connections. The possibility of network partitions also increases in this mode due to network congestion and other factors. eXtreme Scale approaches this harsh environment in the following ways.


Limited heart beating between zones

Java virtual machines grouped together into core groups do heart beat each other. When the catalog service organizes Java virtual machines in to groups, those groups do not span zones. A leader within that group pushes membership information to the catalog service. The catalog service verifies any reported failures before taking action. It does this by attempting to connect to the suspect Java virtual machines. If the catalog service sees a false failure detection then it takes no action as the core group partition will heal in a short period of time.

The catalog service will also heart beat core group leaders periodically at a slow rate to handle the case of core group isolation.


Catalog service as the grid tie breaker

The catalog service is the tie breaker for an eXtreme Scale grid. It is essential that it acts with a single voice for the grid to execute. The catalog service runs on a fixed set of Java virtual machines replicating data from an elected primary to all the other Java virtual machines in that set. The catalog service should be distributed amongst the physical zones or data centers to lower the probability that it gets isolated from the grid and can survive anticipated failure scenarios.

The catalog service communicates with the container Java virtual machines in the grid using idempotent or recoverable operations. It does this communication using IIOP. All state changes in the catalog service are synchronously replicated amongst the current members hosting the catalog service. This replication only succeeds if the majority of the Java virtual machines accept the change. This means that if the catalog service partitions then only the majority partition can commit changes. The service primary will only send commands to the container Java virtual machines if the state change that creates that command commits. This means a minority partition cannot advance its state or issue commands to containers.

A partitioned catalog service does not stop the grid from functioning. The grid will still accept client requests and execute operations. If there is no majority catalog service partition then failures in the grid will not be recovered until the catalog service regains majority. If recovery is delayed then over time as failures occur then certain partitions will go offline until the catalog service regains majority.

The core group leader Java virtual machines report membership changes to the catalog service. If the catalog service partitions then the service will push back an updated route table for the catalog service. Such a route table from a minority partition will not include the location for a primary. The leader will need to iterate across all the possible catalog service Java virtual machines to try to locate the primary partition. It will need to do this periodically while waiting for the partition to be resolved. After it receives a route table with a primary then pending recovery actions to be directed by the majority catalog service primary.

If the core group cannot connect with a catalog service primary for a period of time then either it is physically disconnected from the rest of the grid (possibly with a minority catalog service partition) or the catalog service is stuck in minority partitions. It is impossible to tell the difference. If there is a majority catalog service partition then it may be recovering from the apparent loss of the disconnected core group. This may lead to two primaries for the same partition, the old existing primary and the new primary in the rest of the network. The majority catalog service partition has no way to 'kill' the old primaries given it's in a disconnected network state with the old primaries. When the catalog service recovers and the disconnected core group discovers the new primaries then the catalog service will notice that there are two primaries. It will instruct the previously disconnected core groups to delete all shards and then balancing will occur.

If the catalog service partitions in to two minor partitions or a single surviving minor partition then the customer will need to get involved to help with recovery. A Java Management Extensions (JMX) command is required to specify that a single minority partition is allowed to take action. Verify the other minority partitions are stopped.



Parent topic

Replication for availability


+

Search Tips   |   Advanced Search