In a previous post we have talked about setting up a WebLogic cluster environment. In this post we will look at how to size the virtual machine. The environment uses VMware vSphere 5.0. Why VMware you ask? Well the same comment can be used to answer that question when describing Coherence: How beautiful can software get? “Beautiful programs work better, cost less, match user needs, have fewer bugs, run faster, are easier to fix, and have a longer life span. Beautiful software is as small as it can be, by using existing computing resources where possible. Beautiful software is simple. Beautiful software is achieved by creating a ‘wonderful whole’…”. vSphere is beautiful software!

Now comes the other question you probably have, as we were talking in a previous post about cost-effectiveness. When using WebLogic server in an environment that is virtualized using VMware vSphere, the obvious question to ask is: “What about the licenses?”. Two blog entries provide clarification on this matter (support and licensing). The latter states the following: “When running products that are licensed by physical processor on vSphere, customers should ensure the following:

  • Virtual machines are running on hosts fully licensed for Oracle.
  • Virtual machine movement within a cluster is restricted to hosts that are fully licensed for Oracle.
  • Virtual machine movements are tracked so that customers are able to demonstrate compliance with Oracle licensing policies.

Many Oracle products are licensed by physical core or socket, and for these products Oracle does not have a virtual CPU-based licensing mechanism. In a vSphere environment, the consequence of Oracle’s licensing policy is that customers must license all physical cores or sockets in the vSphere host (fully licensed host).”

The number of processor licenses needed, is calculated as: “Processor: shall be defined as all processors where the Oracle programs are installed and/or running. Programs licensed on a processor basis may be accessed by your internal users (including agents and contractors) and by your third party users. The number of required licenses shall be determined by multiplying the total number of cores of the processor by a core processor licensing factor specified on the Oracle Processor Core Factor Table which can be accessed here. All cores on all multi-core chips for each licensed program are to be aggregated before multiplying by the appropriate core processor licensing factor and all fractions of a number are to be rounded up to the next whole number.” For example, when running on Linux we can use cat /proc/cpuinfo to retrieve the required CPU information, we need to calculated the number of processor licenses. Say, for example that we have a total of 16 processors with model name: Intel(R) Xeon(R) CPU X5687 @ 3.60GHz, and cpu cores: 4. This means we have 16 * 4 cores, and looking in the core licensing processor factor table we find that the Intel Xeon X5687 has a factor of 0.5, which means we have to get 16 * 4 * 0.5 = 32 processor licenses in this particular example.

Once the host is fully licensed, we are allowed to run an unlimited number of virtual machines and application instances on that host. We can take advantage of advanced features, such as Dynamic Resource Scheduler and vSphere HA, to get a high infrastructure utilization. By using these features we can try to consolidate physical processors.

To get the right sizing for the virtual machine we need a controlled and repeatable load to be driven against the application in order to collect meaningful performance results. Useful metrics to watch are response-times and throughput. For a user of an interactive application, observed response-time is the primary measure of performance, i.e., measure the change in response-time as the load is increased. With throughput we measure the number of operations per second. One thing to note is that a configuration with the highest throughput will not mean that it also provides better response-times at a more reasonable load. Some things to keep in mind are:

  • When selecting the total VM resources for a Java deployment, including number of vCPUs and memory size, it is important to provide sufficient resources to keep the CPU utilization of the VM at reasonable levels even during periods of peak load.
  • It is important to understand the scaling of demands placed by the application on the VM infrastructure when choosing between a scale-up (adding more vCPUs) or scale-out (deploy the application on multiple smaller VMs) approach to Java application deployment. In particular, scaling-up beyond a certain point may cause the load to exceed the bandwidth or throughput limits of a VM’s NICs or storage adapters.
  • When virtualizing a Java application onto an ESXi host or cluster supporting other applications, the shared resource effects (resources approaching saturation as load is increased) can impact the performance of a newly virtualized application. For example, the response-times of the support services, such as the database and filestore, are key components of the overall operation response-times. There can also be limits imposed by the effect of the increased load on the hardware configuration. The most basic components impacted by the increase in load are the processor resources (for example, caches, TLBs, and memory controllers) shared by the individual processor cores. A high network load can also make shared NICs a potential bottleneck. This has the following implications:
    • Whenever possible, initial performance testing of a virtualized application should be done on an otherwise unloaded ESXi host. This will eliminate the impact of shared resource effects.
    • When investigating performance issues, it is important to understand the loads on all shared resources, and not only on the VM under investigation.
  • When comparing the performance of native and virtualized Java deployments:
    • Compare performance at multiple loads, including performance at loads that represent expected operating conditions for the application when deployed in production. A comparison of peak-throughput serves to uncover the saturation point of the application/infrastructure combination, but does not provide information about the user experience at more reasonable loads.
    • Always ensure that the underlying infrastructure, including server hardware, provides comparable performance. Comparing an application running natively with on a server with a certain number of CPU cores to a VM with a different number of vCPUs will give erroneous results.

A best practice is to establish the size of the virtual machine in terms of vCPU, memory and the number of JVMs by conducting (as already mentioned) a performance test that mimics the production workload profile. The resulting virtual machine is our building block, which can than be used for the horizontal scaling (scale-out). To create the building block we can use the following:

  • VM sizing and VM-to-JVM ratio through a performance load test – Establish a workload profile and conduct a load test to measure how many JVMs can be stacked on a particularly sized virtual machine. In this test, establish a best case scenario of how many concurrent transactions can be pushed through a configuration before it can be safely deemed a good candidate for scaling horizontally in an application cluster.
  • VM vCPU CPU over-commit – For performance-critical enterprise Java applications virtual machines in production, make sure that the total number of vCPUs assigned to all of the virtual machines does not cause greater than 80% CPU utilization on the ESXi host.
  • VM vCPU (do not oversubscribe to CPU cycles that are not needed) – For example, if the performance load test determines that 2 vCPUs is adequate up to 70% CPU utilization, but instead allocate 4 vCPU to the virtual machine, then potentially there can be 2 vCPUs idle, which is not optimal. If the exact workload is not known, size the virtual machine with a smaller number of vCPUs initially and increase the number later if necessary.
  • VM memory sizing – Whether using Windows or Linux as the guest OS, refer to the technical specification for memory requirements. It is common to see the guest OS allocated about 0.5GB to 1GB in addition to the JVM memory size. However, each installation may have additional processes running on it, for example, monitoring agents, which must also be accommodated for. A formula that summarizes virtual machine memory can formulated as: VM Memory (needed) = guest OS memory + JVM Memory. Here, the JVM Memory = JVM Max Heap (-Xmx value) + Perm Gen (-XX:MaxPermSize in the case of HotSpot) + NumberOfConcurrentThreads * (-Xss). It is recommended that the memory is not over-committed, because the JVM memory is an active space where objects are constantly being created and garbage collected. Such an active memory space requires its memory to be available all the time. If memory is over-committed, memory ballooning or swapping may occur and impede performance. An ESXi host employs two distinct techniques for dynamically expanding or contracting the amount of memory allocated to virtual machines. The first method is known as memory balloon driver. This is loaded from the VMware Tools package into the guest operating system running in a virtual machine. The second method involves paging from a virtual machine to a server swap file without any involvement by the guest operating system. In the page swapping method, a corresponding swap file is created and placed in the same location as the virtual machine configuration file (VMX file). The virtual machine can power on only when the swap file is available. ESXi hosts use swapping to forcibly reclaim memory from a virtual machine when no balloon driver is available. The balloon driver might be unavailable either because VMware Tools is not installed or because the driver is disabled or not running. For optimal performance, ESXi uses the balloon approach whenever possible. However, swapping is used when the driver is temporarily unable to reclaim memory quickly enough to satisfy current system demands. Because the memory is being swapped out to disk, there is a significant performance penalty when the swapping technique is used. Therefore, it is recommended that the balloon driver is always enabled, but monitor it to verify that it is not being invoked when that memory is over-committed. Both ballooning and swapping should be prevented for Java applications.
  • Set memory reservation for virtual machine memory needs – JVMs running on virtual machines have an active heap that must always be present in physical memory. Set the reservation equal to the needed virtual machine memory. Reservation Memory = VM Memory = guest OS Memory + JVM Memory. If for example, we have a 4GB heap, then it is likely that the JVM Memory is approximately 4.5GB, with another 0.5GB needed for guest OS. Therefore, the total of virtual machine memory needed is 5GB, so a memory reservation of 5GB must be configured for the virtual machine.
  • Use large memory pages – Large memory pages help performance by optimizing the use of the translation look-aside buffer (TLB), where virtual to physical address translations are performed. When sizing memory for large pages to be consumed by the JVM, leave a certain amount of small memory pages for other processes that cannot use large pages.

Information on tuning the JVM can be found here. The posts Tune the JVM that runs Coherence and Building a Coherence Cluster with Multiple Application Servers contain detailed steps on how to set-up large pages. One important thing to note is that when the JVM heap is increased , we most likely have to increase the number vCPUs as well. This in order to get good garbage collection cycle performance.

References

[1] Performance Best Practices for VMware vSphere 4.0.
[2] Performance Best Practices for VMware vSphere 5.0.
[3] Performance of Enterprise Java Applications on VMware vSphere.
[4] Enterprise Java Applications on VMware Best Practices Guide.