WebSphere Portal: Performance testing and analysis
- Overview
- The environment
- Portal infrastructure baseline
- Application of the Portal Tuning Guide recommendations
- Load generation
- User scenarios
- Think time
- Cookies and sessions
- Metrics
- Virtual user as opposed to Think Time
- Repeatability principle
- Driving to saturation
- Bottleneck analysis
- The process
- Note on ramp rates
- Priming the portal
- Java thread dumps
- JVM heap utilization
- Enable verboseGC
- Log
- Java class and variable synchronization
- Database contention
- LDAP responsiveness
- Excessive session sizes
- Exceptions being thrown
- Dynacache concerns DRS replication modes
- Dynacache eviction concerns
- Operating system concerns
- Capacity planning
- The process
- Extrapolating results
- Testing with the full cluster
- Failover testing
- Ongoing capacity planning
- Vertical clustering
- Costs of vertical clustering
- Indications for vertical clustering
- Reliability
- Memory utilization
- Java synchronized methods and class variables
Overview
Performance testing has three main objectives...
- Determine the load level at which a system under test fails
- Find and remove bottlenecks in a system that throttle throughput.
- Capacity planning: Predict the amount of horsepower needed to sustain defined users loads within service level agreements.
The system is defined as the complete end-to-end set of components required to deliver the requested Web page to the requesting user's browser. The most visible and often the most troublesome components tend to be...
- WebSphere Portal
- WebSphere Portal database
- LDAP
- databases
- appservers
The back-end systems tend to present the most risk in WebSphere Portal deployments because they are frequently maintained by separate organizations. This separation dilutes the communication channel between the WebSphere Portal deployment team and the back-end teams with respect to performance objectives.
The environment
To meet the performance test objectives outlined previously, the performance test environment needs to be either the production environment itself or a mirror of production. That mirror has, as is practically possible, the same hardware, the same topology, and the same back-end systems. If any piece of this complex test topology is different from its production counterpart, extrapolate the results in the test environment to predict its effect in the production environment. These extrapolations generally require detailed implementation knowledge of the portal and the deployed applications, which generally are not available to the testing organization. By making the test environment equivalent to the production environment, your confidence in the test results as they relate to what actually happens in production becomes acceptable.
An important goal of the test environment is the repeatability of results. As slight changes are made in the system, repeatability ensures that you can accurately measure the effect of these changes. For that reason, it is optimal to have the system on an isolated network during the performance testing. Running the performance test on the production network introduces variability (for example, user traffic) that can skew such metrics as page render response time.
There is also a more pragmatic reason to isolate the performance test network. Putting WebSphere Portal under stress likely puts the corporate network under stress. This stress is often problematic during normal business hours.
If placing the performance test on an isolated network is not feasible, you should at least try to ensure that the components of the test are all collocated on the same subnet of a network router. WebSphere Portal best practice recommends using a gigabit Ethernet connection between the portal and its database. Optimally, this connection extends to...
- LDAP servers
- Web servers
- back-end services
Load generators should be on a LAN segment local to the Web server and/or the portal.
A common concern involves the load generators being on the same local LAN segment as the portal itself. In this case, this test does not get a true picture of the performance of the system as it excludes the network from the data center to the users.
The process described here is for tuning and resolving issues with the portal and its surrounding components. Trying to tune the network between the users (or the load generators) and the portals makes the analysis and problem resolution needlessly complex. We therefore remove it from the test. There are far better tools and processes for network tuning than the processes used here.
Portal infrastructure baseline
Conduct an incremental set of baseline tests that exercise the infrastructure. At that point, subsequent tests should then gradually augment the portal with custom code. The test plan should thus move from a simple topology to the final production topology to make it easier to isolate problematic components.
The first test is the complete WebSphere Portal infrastructure using an out-of-the-box portal...
- Transfer the database
- Enable security
- Configure front-end Web servers, firewalls and load balancers
- Configure security managers (SiteMinder, WebSeal)
Create a simple home page with a couple of portlets that do not access any back-end systems (for example, the World Clock portlet). Create a simple load testing script that accesses the unauthenticated home page and then logs in (authenticates) and idles without logging out. From this point, you want to add simulated users (virtual users) until the system is saturated. Using the bottleneck analysis techniques described below, find and fix any bottlenecks in the infrastructure. Note the performance baseline of this system.
Now, add to the system any customized themes and skins, and repeat the previous test. Find and fix any important bottlenecks in the revised system. Finally, as described below, add the actual portlets to be used on the home page and perform bottleneck analysis.
This baseline environment can be very effective in finding bottlenecks in the infrastructure that are independent of the application. Further, it can provide a reference when analyzing the extent to which the applications place additional load above and beyond the basic WebSphere Portal infrastructure.
Your strategy is to conduct the same tests listed below for bottleneck analysis in this baseline environment, optimize the environment, and then perform bottleneck analysis with the actual applications.
Application of the Portal Tuning Guide recommendations
Apply the recommendations outlined in the WebSphere Portal Tuning Guide to all systems before you embark on any performance testing. The guide provides a good starting point because it fixes known performance inhibitors in a default WebSphere Portal installation. Although bottleneck analysis would likely find the same problems, it is better to remove them from the beginning.
The tuning guide is a starting point for your performance testing and not the final set of configuration changes needed to optimize your Portal. Your application(s) along with your unique themes and skins can greatly affect the correct setting needed to optimize performance.
Load generation
A performance test requires the use of a load generator to produce simulated user requests for Web pages. This tool should produce such metrics as...
- response time
- page views per second
...to determine when the system fails its Service Level Agreement (SLA) contract or is saturated to the point that injecting more page requests per unit time does not result in higher page production. The generator's ability to aggregate data such as...
- CPU utilization on the portal and HTTP servers
- mod_status data from the HTTP server
...aids in problem determination.
A number of load generators are commonly used to create Web traffic (drive load) in the test system. Some of the more commonly used tools include...
- Mercury Load Runner
- Borland SilkPerformer
- IBM Rational Performance Tester
The load generator should have sufficient virtual users to drive the system to saturation. Note that virtual users do not map directly to actual users. A virtual user represents an active channel on the load generator. A virtual user may simulate multiple actual users; however, only one actual user can be active for each virtual user.
If the system requires authenticated access to the WebSphere Portal applications under test, sufficient unique test user IDs must exist in the LDAP directory and that scripts should ensure that only a reasonable number of duplicated logins occur during the test. (A reasonable number in this context accounts for the fact that some users might have a couple of instances of the browser open, each with the same WebSphere Portal login ID.) WebSphere Portal has a large caching infrastructure for portal artifacts. These artifacts are generally cached on a per-user basis. If the load simulation uses the same user ID for all tests, performance appears artificially high because the artifacts do not need to be loaded from the LDAP directory and the database.
User scenarios
To tune the WebSphere Portal system to handle large numbers of users and to accurately predict its ability to handle specific numbers of users correctly, determine the most probable scenarios for users of the system. The test must then accurately simulate those user scenarios using the load generator. One effective way to do this step is to list the most likely use cases. Write a script for each of these use cases or as many as are practical. Now, assign a probability of likelihood that a percentage of the whole user population will execute that scenario. As the test is run, assign use cases to virtual users in the same proportion as the expected general population. As the number of virtual users is ramped up (discussed later), try to maintain this proportion.
A virtual user represents one active channel over which requests are made and returned.
Think time
Think time is the average amount of time that a normal user pauses during individual mouse clicks or key presses during the course of using WebSphere Portal. In the load generation tools, this time is usually programmable, yielding a random time within a predefined range.
As think time is reduced, the number of requests per second increases, which in turn increases the load on the system. Reducing think time generally increases the average response time for WebSphere Portal login and page-to-page navigation. Therefore, accurately estimating real user think time is important for producing an accurate model of the system in production, particularly for capacity planning.
In most use cases, a think time of 10 seconds plus or minus 50 percent is reasonable for a portal having experienced users. A figure closer to 30 seconds is more reasonable for a portal with inexperienced users.
Cookies and sessions
Generally, most real users log into the portal and execute the task that needs to be done; however, they rarely log out by using the logout button. Rather, they let the browser sit idle until their session times out. Typically, a lot of sessions in memory are waiting for cleanup pending the WAS session timeout. This behavior increases JVM heap working set, which increases the probability of heap exhaustion in the JVM. Heap exhaustion can be both a performance bottleneck and a cause for a JVM failure.
Effective simulations must model this behavior of users who do not explicitly log out. As each individual simulation executes a particular use case, it should end the use case by going idle as opposed to logging out. As the script cycles back around to log in a new user on this particular virtual user, the cookies for old session (JSESSIONID) and LTPA along with any application-specific cookies need to be cleaned up appropriately before logging in the next user using that script. This model also implies that sufficient test IDs need to exist so that a test ID can sit idle for the length of the WAS session timeout without risk of being reused until the previous session times out.
Metrics
The most important metrics is page views per second.
Also important are request response times. As login is very expensive in Portal, login response time, along with page-to-page response times, needs to be instrumented. Most of the load generators already provide aggregate Page View per second (page views) metrics.
At the conclusion of each test, a graph of virtual users ramp rate versus the three metrics is required for doing analysis.
In addition to the metrics gathered by the load generation tool, a system monitoring tool such as...
- IBM Tivoli Composite Application Manager (ITCAM) for WebSphere
- Computer Associates Wily IntroScope
These tools run on the WebSphere Portal instance and instrument the JVM directly. They are useful in both detection and resolution of system bottlenecks
Virtual user as opposed to Think Time
A common misconception is that to accurately simulate a large population that generates requests at a certain rate, a smaller number of users that generate requests with a smaller think time will suffice.
It's important to note that the effects of running with a small number of users and a low think time results in unrealistically high cache hit rates. It also means that too few sessions are created. Because session size is often a serious problem for many portlet applications, this approach gives an unrealistically good view of the system performance and leads to surprises in production.
Another poor practice is running a small set of virtual users with no think time.
Repeatability principle
In a large population, it is easy to assume that most user actions appear to be random as users navigate through the portal. Experienced users, though, typically use the same patterns over and over. Furthermore, from a test engineering perspective, the user scenarios need to be reasonably static so that system changes can be effectively measured from run to run.
Therefore, the definition of the repeatability principle is that for all runs of a particular scenario, the metrics (average response time, page view's, saturation point, and so on) produced by the runs all converge to the same results if the runs are sufficiently long. Note that with more variation (that is, unique scenarios) in the test scripts, longer times are required to converge, on average.
The simulation scripts written for the performance tests should adhere to the repeatability principle.
Driving to saturation
Saturation is defined as the number of active virtual users at which point adding more virtual users does not result in an increase in the number of page views. Note that this saturation point is for a given simulation; each different simulation likely has a different saturation point.
To effectively drive a system to saturation...
- Add virtual users a few at a time
- Let the system stabilize
- Observe whether page views increase
- Add more virtual users as possible
"Stabilize," in this context, means that the response times are steady within a window of several minutes. On Rational Performance Tester, if you plot virtual users against throughput (page views), the page views initially rises linearly with the number of virtual users, then reaches a maximum and actually decreases slightly from that point. The saturation point is the number of virtual users at which the page views is at maximum.
Bottleneck analysis
The goal of bottleneck analysis is to remove impediments which inhibit driving the system to a higher load. The metric defined for higher load is a higher number of page views at saturation. Therefore, bottleneck analysis removes impediments to improve the saturation point.
Bottlenecks in a WebSphere Portal environment under load are generally the result of two issues...
- Contention for shared resources...
- synchronized Java classes
- methods
- data structures
- contention for serial resources (for example SystemOut.log)
- Excessive response times in...
- back-end databases
- remote WCM systems
- Web servers
- network
- routers
- firewalls
As load increases, contention for these resources increases, making contention locks easier to detect and correct. This detail is why effective load testing is a requirement for bottleneck analysis.
A common mistake is to focus only on page response times. Many performance testers prefer to optimize render response times because this delay is the most obvious user requirement. This type of performance analysis requires path length reduction in custom portlet applications. Response time optimization is generally more appropriately done in a non-loaded system and with tooling specific to the task (for example, JProbe).
The process
The process of performing bottleneck analysis is straightforward. For a particular performance analysis (for example, Rational Performance Tester) simulation...
- Ramp a single WebSphere Portal JVM to saturation.
- Determine the bottlenecks that exists at saturation.
- Resolve the bottlenecks.
Unless satisfied with system capacity, go to step 1 and find the next bottleneck.
Note that this process is iterative. The key concept is that you fix one bottleneck to find the next bottleneck. You stop the process either when you are satisfied with the system performance or when the cost to resolve the next bottleneck becomes unjustifiable.
Most shops generally do not allocate enough time for this work because they fail to realize the iterative nature of this process.
A single JVM is used for this process because detection of the bottleneck is much simpler. Finding and resolving cross-JVM contention can be quite complex. After a single JVM has been tuned as much as desired, you move on to the capacity planning analysis for multiple nodes as described later in this article.
Note on ramp rates
A common question in performance testing is the rate at which virtual users should be ramped into the system.
Do not ramp in several hundred users as quickly as possible until the system collapses. This approach is not representative of reality, and it does not provide repeatable results.
You should model reality. Predict or measure the actual highest ramp rate that you would expect the portal to endure. This rate might typically occur during the hours that your users most often log into the portal, such as first thing in the morning when they arrive at the office. We recommend that you ramp a small fixed number (for example, two virtual users per minute) for a set period of time (for example, five minutes). Then wait for a time to let the system stabilize (for example, five minutes) at which time you then loop back and add another batch of virtual users in the same fashion.
This technique gives the portal time to fill the various caches in an orderly fashion and provides for the ability to more accurately detect saturation points.
Priming the portal
After a portal restart, a short script should be executed prior to the main test to preload certain caches (for example, WebSphere Portal access control and the anonymous page cache before the real test starts. Failure to do so can skew the initial response times inordinately.
Java thread dumps
After you have a portal at saturation, take a Java thread dump...
kill -3 pid
...against the portal Java process under test. Look for threads that are all...
- blocked by the same condition
- waiting in the same method of the same class
In general, search for threads that are blocked or in a wait state. By ascertaining why certain classes statistically show up blocked, you can then proceed to remove that reason and thus remove the bottleneck. The next section discusses some common bottleneck problems.
JVM heap utilization
Apply the JVM tuning recommendations as outlined in the Portal Tuning Guide.
Enable verboseGC
Leave verboseGC enabled, even during production. The amount of log data is not large; however, it is invaluable in terms of the visibility that this log brings to heap utilization problems.
If the size of the native_stderr.log file becomes a concern, configure log rolling...
-Xverbosegclog:${SERVER_LOG_ROOT}/verboseGC#.log,5,10000To have Java object allocations greater than 1M be recorded in native_stderr.log, go to...
Servers | Application Servers | WebSphere_Portal | Java and process management | Process definition | Java Virtual Machine | Custom properties
...and set...
ALLOCATION_THRESHOLD = 1000000
Set -Xloratio0.1 to reserve a larger area for large objects in the heap other than the default. If you are experiencing Out of Memory errors and they coincide with large object allocations when there seems to be plenty of heap available in the verboseGC log, heap fragmentation due to large object allocations is the likely culprit.
If the verboseGC log indicates a large number of mark stack overflows (MSOs), performance under load likely suffers. The use of -Xgcthreads to override the default provides additional mark stack space, which provides relief from MSOs.
Log
Log using direct writes to SystemOut.log or using a logging class such as log4j causes serialization between running threads and significantly degrades portal performance. In production portal systems, log only what is absolutely needed. When using log4j, log only errors; do not log warnings or informational messages. If logging is required for audit purposes, consider using a portal service or a different service running in a separate JVM.
Turn off all logging and remove all debug code that writes to files before doing performance testing.
Java class and variable synchronization
Use of method-level synchronization blocks where a method is in a monitor wait (MW) state with one method holding a lock can be problematic. In this case, you have Java code that is synchronized and is causing serialization in the system.
Use of synchronized class variables or synchronized HashMaps can also cause this problem.
In both cases (method or variable synchronization), the problem can be exacerbated by arbitrarily increasing the number of WAS transport threads in which the portal runs. By increasing the number of threads, you increase the probability of hitting portal code that is synchronized in this fashion, which ultimately serializes all the threads.
Database contention
If the thread dump indicates numerous threads waiting in JDBC classes in Socket.read() methods, then there are likely response time issues in the database itself.
At initial database transfer time, portal sets up the databases with indexes that should be good initial starting points. It is imperative, though, that an excellent DBA monitors the database to ensure efficient operations. As a result of this monitoring, the DBA might need to effect changes on the DB to remove bottlenecks in the system.
Some common problems and resolutions that have been seen include the following:
- Queries taking excessive time due to table scans
- Insufficient processor and memory resources on the DB server itself
- Insufficient allowed connections as opposed to the configured JDBC pool sizes on Portal and Lotus WCM
DBAs should, especially when thread dumps indicate excessive JDBC wait times, take snapshots for long queries. Generally, Portal and Lotus WCM queries all execute in subseconds, if not in milliseconds. Look at the execution plans for long-running queries, and see if additional indexes might be required to improve response times on problematic queries.
When threads are waiting on JDBC pool resources in WAS, you see the threads in a condition wait (CW) state in the WAS connection pool (J2C) classes. In this case, you might need to increase the pool size for this data source. Note that in doing so, you might need to increase the number of connections that the database server can handle concurrently.
LDAP responsiveness
If several threads are in the Socket.read() method of the Java Naming and Directory Interface (JNDI) classes, they are likely waiting on results from the LDAP directory.
Excessive session sizes
If custom portlets are storing too much data in the session, that condition invariably leads to memory and performance issues.
Exceptions being thrown
Even though this problem might seem obvious, in many situations performance analysis and bottleneck reduction are attempted in systems that are repeatedly throwing exceptions in the logs. When the JVM is handling unchecked exceptions, it slows the JVM down and causes serial I/O (printing) to the SystemOut.log print stream, which serializes the WAS transport threads.
A more general issue involves trying to characterize and tune a system that is inherently flawed. All results that are generated in such an environment must be labeled as non-repeatable and subject to change (potentially in a significant way) as the flaws are eliminated.
Finally, it should be your policy that the WebSphere Portal system is not allowed to enter a high-load production environment with any errors in the logs.
Dynacache concerns DRS replication modes
WebSphere Portal requires that the Dynamic Cache Service be enabled. The dynamic cache (dynacache) is a data structure that is used to provide caching of data from back-end services (for example, database results) in WebSphere Portal. Dynacaches can ensure cache synchronization across a cluster of WebSphere Portal members. For proper operation in a cluster, WebSphere Portal requires that cache replication be enabled. The default mode of replication, PUSH, can cause performance problems, though, in the WebSphere Portal environment.
For WebSphere Portal V6.0.1.5 and V6.1 change the default for all Portal and Lotus WCM dynacaches to be NOT SHARED instead of PUSH.
For WebSphere Portal V6.0.1.4 and earlier...
- Set the replication mode to NOT SHARED using the WAS console for each cluster member
- Install Portal PK64925
- Install WMM PK62457 and add the parameter cachesSharingPolicy with a value of NOT_SHARED to the LDAP section of the wmm.xml files on each node.
WebSphere Content Managers (WCM) dynacaches also should be set to NOT SHARED. To complete this task, in the Deployment Manager console, navigate to...
Resources | Cache Instances | Object Cache Instances
...and change each of the individual cache instances to a mode of NOT SHARED. As of the time of this writing, there are 11 instances for WebSphere Content Manager.
Finally, there are WAS changes that can further, although marginally, reduce the amount of network traffic between cluster members due to replication events. For each cluster member (either WebSphere Content Manageement or WebSphere Portal), navigate to...
Servers | Application Servers | WebSphere_Portal | Java and process management | Process definition | Java Virtual Machine | Custom properties
...then click New to define the following properties...
com.ibm.ws.cache.CacheConfig.filterLRUInvalidation=true
com.ibm.ws.cache.CacheConfig.filterTimeOutInvalidation=true
com.ibm.ws.cache.CacheConfig.cacheInvalidateEntryWindow=2
com.ibm.ws.cache.CacheConfig.cacheEntryWindow=2Dynacache eviction concerns
Since WebSphere Portal version 5.1.0.2, the size of the WebSphere Portal dynacaches has been increased to a default that is appropriate for most WebSphere Portal applications. There are situations, though, in which these defaults are inadequate and can cause significant performance problems.
For example, if a portal has a large number of derived pages with a common parent, the portal access control (PAC) caches can be small enough to cause cache thrashing. Similarly, if the portal objectID cache is too small, thrashing occurs.
To monitor dynacaches, and other caches, install...
If one or more of the caches seem to have large amounts of least recently used (LRU) evictions, the size of that cache might need to be increased. The sizes of the WebSphere Portal caches are mostly located in the Resource Environment Provider named WP_CacheManagerService. The size of Lotus WCM dynacaches is controlled from the Deployment Manager console in the Object Caches section.
After installing, if you get authorization errors, enable a user in...
Enterprise Applications > Dynamic Cache Monitor > Security role to user/group mapping
Operating system concerns
Tools like techline can size the host environment required for enterprise deployments of Portal and Lotus WCM, there are problems that can arise even on adequately sized hosts.
Under no circumstances should memory paging occur on an operating system hosting Portal or Lotus WCM. If it is, actions must be taken to alleviate this situation. Performance will immediately and dramatically degrade in the presence of paging.
Enable large page support on AIX and set the JVM property -Xlp to dramatically improve memory utilization.
On AIX, consider setting the memory management option lru_file_repage to 0 to ensure that computational memory is prioritized over file I/O buffers. This setting ensures that in situations where physical memory becomes limited, AIX will not swap out the Java processes in favor of file I/O buffers.
Synchronized class variables
Excessive database calls. Consider using DB caching layers or dynacache to reduce the load on application databases or back-end services.
Unsynchronized use of HashMaps. There are timing scenarios in which these classes get into infinite loops if separate threads hit the same HashMap without being synchronized.
Capacity planning
The goal of capacity planning is to estimate the total number of WebSphere Portal JVMs required that satisfy a certain user population within predetermined SLA metrics prior to entering production.
Typical metrics include:
- Portal login response time (typically around four seconds)
- Page-to-page response times after being already logged in (typically around two seconds)
The process
The process for running the load test looks very much like the one for running the test for bottleneck analysis except that there is now a second criterion for stopping the test. One criterion is saturation, as previously defined. The second criterion is failure of any of the SLA metrics.
If the test reaches saturation before any of the SLA metrics are exceeded and if it has already been determined that there are no bottlenecks that can or will be excised, then you can immediately calculate the number of nodes required.
If the SLA metrics are exceeded before reaching saturation, then analyze the failure to determine the next course of action. If you determine that you do not need to resolve the response time issues, then proceed directly to calculating the number of nodes, as discussed in the next section of this article.
Extrapolating results
In general, if a single WebSphere Portal node can sustain n users within given SLA metrics, then 2 nodes can sustain 1.95 * n users. The accepted horizontal scaling factor for a portal is .95. Thus, if a single WebSphere Portal node can sustain n users within given SLA metrics, then m nodes can sustain:
n (1 + .95 + .952 + .953 + â + .95m)Thus, the horizontal scaling factor is slightly less than linear.
This scaling factor assumes that the database capacity does not bottleneck the system. In fact, this scaling factor is primarily a metric of the degeneration of the WebSphere Portal database for logging in users.
Vertical cloning (scaling) is somewhat different. Vertical cloning is indicated when a single JVM saturates a node at a processor utilization around 80 percent or less. Note that in most cases, bottleneck analysis usually provides relief. In the absence of Java heap issues, a single JVM can usually be tuned to saturate a node at 85 to 90 percent processor utilization.
Vertical scaling is discussed more fully later in this article.
Testing with the full cluster
If sufficient load generation capacity exists (including test IDs), it is wise to do a final series of tests in which the whole user community is simulated against the full cluster to ensure viability of the entire system.
Failover testing
If there is a system requirement for full performance during a failover, this scenario should also be scripted and tested.
Before running this scenario, review the plugin-cfg.xml file at the HTTP server to ensure that the cluster definitions are correct. Consider adding the parameter ServerIOTimeOut to the cluster members. This parameter augments the ConnectIOTimeout parameter. ConnectIOTimeout is the amount of time before a cluster member is marked as down in the event that the remote server fails to open a socket connection upon request. The parameter is normally present in the plugin-cfg.xml file and defaults to 0, which means that it relies on the operating system to return timeout status to the plug-in instead of the plug-in explicitly timing the connection itself.
The parameter ServerIOTimeout is, by default, not included in plugin-cfg.xml. This parameter sets a time-out on the actual HTTP requests. If the portal does not answer in the allotted time, the server is marked down. This step is useful because there are certain classes of failures whereby the WebSphere Portal cluster member accepts a socket open request, but the JVM has hung and will not respond to HTTP requests. Without ServerIOTimeout, the plug-in does not mark the cluster member as down; however, it is not able to handle requests. This situation results in requests being routed to a hung server.
During this test, start with the cluster fully operational. Enable virtual users in your simulation to the maximum number that your SLA mandates. Then, stop one or more cluster members. You can do this step gracefully by stopping the cluster members from the deployment manager or by simulating a network failure by removing the Ethernet cable from a cluster node. Many other failure modes might be worth investigating (for example, database failures, Web service failures, and so on). After the simulated cluster member outage, ensure that the surviving cluster members handle the remaining load according to your system requirements. Then, restart the offline cluster members to ensure that the load returns to a balanced state over time.
Ongoing capacity planning
If a system is already in production and is meeting its current SLA goals, you also want to plan for future growth in the number of users of the system. Assuming that the applications on the WebSphere Portal do not significantly change, you can derive the necessary measurements and calculations from a running production system. You need proper tooling, though, to take the measurements.
In short, if n JVM can support x users, then each JVM can support (x/n)^(1/.95) users. Using the formula explained previously, you can easily plan for future growth.
Vertical clustering
A common technique for improving performance is to vertically clone the WebSphere Portal JVM on the same physical system. Engineers initially assume that if one JVM is good, two must be better.
The ultimate goal of vertical cloning is to increase the net aggregate throughput in transactions per second of the sum of the cluster members (clones) on a single node. This goal is usually possible only if, when running under the load, a single, well-tuned cluster member does not consume most of the CPU available in that node. In fact, in a well-tuned WebSphere Portal, vertical cloning always carries a cost. Vertical cloning is indicated when the benefits outweigh the costs.
WAS clustering comes in two flavors. The first is the horizontal type. In this arrangement, a functionally equivalent duplicate of an appserver is created on another node. This duplication is done with a WebSphere component known as the deployment manager. The resulting set of equivalent nodes is known as a cluster. The result is that a front-end HTTP server can forward a request from a client to either of the cluster members (clones), and the result is identical.
Similarly, you can also create cluster members vertically, which means that multiple JVMs are created on the same node. Each cluster member can serve the same content just as in the horizontal cluster member case.
In the WebSphere Portal case, each cluster member shares the one (and only one) WebSphere Portal database. This statement changes slightly in WebSphere Portal V6, but it is true for V5.x. Therefore, as the number of cluster members increases, the WebSphere Portal database has a higher likelihood of becoming a bottleneck due to the dilution of its capacity.
Costs of vertical clustering
When additional cluster members are active on the same physical node, costs are associated with it. First, there is process context switching. The operating system must now manage additional processes (JVMs).
Second, there is more contention for processor resources. Generally, vertically clustering is always a bad choice if the number of active cluster members exceeds the number of processors in the node less one. You should never have three cluster members on a three-processor node, for example. Two cluster members on a three-processor node might be acceptable under certain conditions.
Indications for vertical clustering
This section describes some of the situations in which vertical cluster members provide value.
Reliability
Apart from performance concerns, having additional cluster members might make sense strictly for reliability reasons. If a WebSphere Portal installation is on a single node, then in the event of a software failure that crashes one JVM (without crashing the operating system), you can mitigate the effect of the crash by adding vertical cluster members. The assumption is that most software failures are localized to a single JVM and do not affect the others on the same node. Therefore, the cluster continues serving requests while the failing JVM is restarted.
Memory utilization
In a 32-bit operating system, process address spaces are limited to 4 gigabytes of memory. Most operating systems split this space as 2 gigabytes of user space and 2 gigabytes of kernel space. There are exceptions whereby the user space can be increased to ~3 gigabytes and the kernel reduced to 1 gigabyte (Solaris, AIX®, and Microsoft® Windows® 2003 Enterprise, for example).
If the address space available to the JVM is 2 gigabytes, then the JVM can allocate approximately a 1.5-gigabyte heap space.
There are cases when the combination of the WebSphere Portal base memory working set, along with the total memory required for all the portlets running during stress, could approach and exhaust the 1.5-gigabyte heap. When this happens, and if there is still a significant amount of processor resource available (20 to 30 percent or more), then vertical cloning could increase the total throughput of the box by effectively creating 3 gigabytes of JVM heap and dividing the workload evenly between the two 1.5-gigabyte heap JVMs.
Java synchronized methods and class variables
If your WebSphere Portal application (and the portal itself) uses enough synchronized methods or class variables, you can, under load, end up with a high and frequent number of blocked threads in the appserver. You can identify this situation by taking thread dumps under load and noticing that there are lots of Web container threads sitting in MW state waiting for these synchronized artifacts.
In this case, reducing the maximum number of Web container threads on a per-cluster-member basis reduces these stalls. If, after that change, the processor is not consumed as described previously, then vertical cloning can increase the aggregate throughput for the whole node.