In this post we are going to tell you everything you ever wanted to know about the JRockit JVM, well quite a lot anyway. First we dive into the JRockit fundamentals, like code generation, memory management and threading. Next we use these fundamentals in the tuning section. We end this post by discussing JRockit Mission Control and its possibilities in tracking down problems.

Fundamentals

Code Generation

As Java is based on a write once/run everywhere concept, which uses bytecode (.class files) as a platform-independent format, we need an efficient way the generate code which can be run by the hardware. In other words the JVM is required to turn bytecode into native code which can be executed by the CPU. Two possible ways to accomplish this are:

  • Bytecode interpretation – Emulate the execution state of the Java program, which includes the emulation of each bytecode instruction as a function of the JVM state.
  • Just-In-Time (JIT) compilation – The bytecode is compiled into native code for a particular platform. In this case methods are compiled just before they need to be executed. Optimizing every method is very expensive and time consuming. Thus, JIT compiler implementations, implement different levels of compiler quality, starting with a quick and dirty version of every method. When a certain method is hot (the number of invocations reaches a certain threshold), the method can be queued for recompilation with more optimization, which of course takes longer.

Note that a native code version runs significantly faster than an interpreted one. The trade-off here is the overhead resulting from bookkeeping and compilation time.

Java is a dynamic language and certain code generation strategies fit better than others, i.e.,

  • Code generation should be done at run-time.
  • All methods must not be treated equal, i.e, there must be distinguished between hot and cold methods, otherwise unnecessary optimizations are spent on cold methods.
  • Bookkeeping must keep up with the adaptive run-time, i.e., old generated native code must be thrown away.

Achieving efficiency in an adaptive run-time environment boils down to minimizing the total execution time, which is equated as code generation time + garbage collect time + execution time. We do not want to spend too much time in code optimizations, we want the JVM to execute Java code in every clock cycle and not use cycles to collect garbage or generate code. Note that a code optimizer can help to reduce garbage collection overhead by generating efficient code with a small memory footprint.

To determine which method is hot and which is not there has to be some kind of profiling involved, i.e., a JVM has to measure in order to determine where code spends its execution time. The more data can be collected the better the decision made is justified. When collecting too much data the overhead becomes noticeable, so the data collected must be just enough to make a justified decision in order to optimize a certain method or not. There are a number of ways to collect data:

  • Invocation counting – is associated with each method and is increased as the method is called. This is done by the byte code interpreter or by adding an add instruction into the native code. Invocation counting may result into a noticeable run-time overhead, because a memory location has to be written to the add method at the start of each method.
  • Thread sampling – periodically examine where the threads are executing and register their instruction pointers. Getting a large amount of data without disrupting anything, requires an internal thread implementation or a custom operating system such as JRockit Virtual Edition.
  • Hardware-based sampling – certain hardware platforms provide hardware instrumentation mechanisms. In this case the hardware architecture allows for collecting large amounts of data without much effort.

In object-oriented languages, method dispatches are compiled as indirect calls to addresses in a dispatch table. A dispatch table exists for every class and contains the possible receivers which depends on the class hierarchy. Static methods are turned into direct calls with a fixed destination, which is faster to execute. As Java allows for methods to be overridden (if the method is not declared final) at any time during execution, the static optimization can not be justly made. When a JIT compiler wants to generate a call to a non-static method it prefers the method to remain unchanged. This is exactly what the JVM gambles on, i.e.,

  • Methods will probably not be overwritten, such that it can be called with a fixed address.
  • A float will probably never be NaN, such that hardware instructions can be used instead of expensive calls to the native floating point libraries.
  • Exceptions will probably not be thrown in a try block, so that the catch block can be marked as cold code.
  • A lock will probably be not too saturated and can start out as a fast spinlock.
  • A lock will probably be taken and released by the same thread, so that the unlock and acquire operations can be treated as no-ops (no operations, i.e., lazy unlocking).

An adaptive environment can correct wrong assumptions (decisions) on run-time, i.e., it can make any assumption as long as this assumption can be corrected at a small cost. JRockit is based on run-time information feedback in relevant areas in order to make the best decisions. Note that an adaptive environment can outperform a static one.

Compiled bytecode is well-defined, i.e, it keeps the operations separated from the operands. Most bytecode operations pop operands from the stack and push results. Native platforms, on which the code must run, on the other hand are not stack machines but rely on registers for storing values. This puts some complexity in the compilation of bytecode into native code. Maybe surprising is how much faster a program runs after an adaptive optimization. The reason for this is that the JVM is very good at detecting patterns which can be optimized. Thus explicit bytecode optimizations are best avoided to say the least. Unfortunately, the JVM is not a magician; it can not turn badly written algorithms into good ones. So we still have to write (very) good code.

The code generator has to perform the following necessary bookkeeping tasks:

  • Object information for garbage collection – A garbage collector needs to keep track which registers and stack frame locations contain Java objects and which were generated by the JIT compiler.
  • Source code and variable information – A JVM must always be able to trace execution points back to a particular line in the source code; this is necessary to support stack traces for debugging purposes.
  • Generated code assumptions – The JVM must remember what assumptions were made when the code was generated. As soon as an assumption is violated an asynchronous code generation request must be send for the affected method.

JRockit is designed to work well out-of-the-box with no additional alterations by using command-line options. This especially applies to code generation and optimization. To gain insight in the code generation and optimization, we can use the -Xverbose:codegen,opt option, which can lead to following output:

[INFO ][codegen][00004] #1 (Normal) jrockit/vm/RNI.transitToJava(J)V
[INFO ][codegen][00004] #1 0.011-0.011 0x2aaab137f000-0x2aaab137f032 0.10 ms (0.00 ms)
[INFO ][codegen][00004] #2 (Normal) jrockit/vm/RNI.transitToJavaFromDbgEvent(J)V
[INFO ][codegen][00004] #2 0.011-0.011 0x2aaab137f040-0x2aaab137f072 0.03 ms (0.00 ms)
[INFO ][codegen][00004] #3 (Normal) jrockit/vm/RNI.debuggerEvent()V
[INFO ][codegen][00004] #3 0.011-0.012 0x2aaab137f080-0x2aaab137f178 0.33 ms 320KB 0 bc/s (0.33 ms 0 bc/s)
[INFO ][codegen][00004] #4 (Normal) jrockit/vm/ExceptionHandler.enterExceptionHandler()Ljava/lang/Throwable;
[INFO ][codegen][00004] #4 0.012-0.012 0x2aaab137f180-0x2aaab137f287 0.33 ms 256KB 0 bc/s (0.65 ms 0 bc/s)
[INFO ][codegen][00004] #5 (Normal) jrockit/vm/ExceptionHandler.gotoHandler()V
[INFO ][codegen][00004] #5 0.012-0.012 0x2aaab137f2a0-0x2aaab137f310 0.03 ms (0.65 ms)
[INFO ][codegen][00004] #6 (Normal) jrockit/vm/RNI.generateVirtualCode(Ljava/lang/Object;JJ)J
[INFO ][codegen][00004] #6 0.012-0.013 0x2aaab137f320-0x2aaab137f434 0.44 ms 256KB 0 bc/s (1.09 ms 0 bc/s)
[INFO ][codegen][00004] #7 (Normal) jrockit/vm/RNI.generateFixedCode(J)J
[INFO ][codegen][00004] #7 0.013-0.013 0x2aaab137f440-0x2aaab137f53d 0.28 ms 256KB 0 bc/s (1.38 ms 0 bc/s)
[INFO ][codegen][00004] #8 (Normal) jrockit/vm/RNI.virtualTrampoline()V
[INFO ][codegen][00004] #8 0.013-0.013 0x2aaab137f560-0x2aaab137f60c 0.02 ms (1.38 ms)
[INFO ][codegen][00004] #9 (Normal) jrockit/vm/RNI.virtualStub()V
[INFO ][codegen][00004] #9 0.013-0.013 0x2aaab137f60c-0x2aaab138000c 0.17 ms 256KB 0 bc/s (1.55 ms 0 bc/s)
[INFO ][codegen][00004] #10 (Normal) jrockit/vm/RNI.fixedTrampoline()V
[INFO ][codegen][00004] #10 0.013-0.013 0x2aaab1380020-0x2aaab13800c4 0.02 ms (1.55 ms)
[INFO ][codegen][00004] #11 (Normal) jrockit/vm/RNI.bridgeCall()V
[INFO ][codegen][00004] #11 0.013-0.013 0x2aaab13800e0-0x2aaab138011c 0.01 ms (1.55 ms)
[INFO ][codegen][00004] #12 (Normal) jrockit/vm/RNI.contextMarker()V
[INFO ][codegen][00004] #12 0.013-0.013 0x2aaab1380120-0x2aaab1380143 0.02 ms (1.55 ms)
[INFO ][codegen][00004] #13 (Normal) jrockit/vm/RNI.callMarker()V
[INFO ][codegen][00004] #13 0.013-0.013 0x2aaab1380160-0x2aaab1380172 0.01 ms (1.55 ms)
[INFO ][codegen][00004] #14 (Normal) jrockit/vm/ExceptionHandler.throwArrayIndexOutOfBoundsException(I)V
[INFO ][codegen][00004] #14 0.013-0.013 0x2aaab1380180-0x2aaab13801e7 0.11 ms 256KB 110092 bc/s (1.66 ms 7246 bc/s)
...
[INFO ][codegen][00004] #1355 (Normal) test/Test.main([Ljava/lang/String;)V
[INFO ][codegen][00004] #1355 0.385-0.385 0x2aaab13dac60-0x2aaab13dac89 0.07 ms 256KB 123288 bc/s (306.57 ms 232932 bc/s)
...
[INFO ][codegen][00056] #1435 (Quick) java/lang/Shutdown.<clinit>()V
[INFO ][codegen][00056] #1435 0.394-0.394 0x2aaab13ae160-0x2aaab13ae256 0.12 ms 256KB 314516 bc/s (323.45 ms 230939 bc/s)
[INFO ][codegen][00056] #1436 (Normal) java/lang/Shutdown$Lock.<init>(Ljava/lang/Shutdown$1;)V
[INFO ][codegen][00056] #1436 0.394-0.394 0x2aaab13dd940-0x2aaab13dd952 0.05 ms 256KB 100000 bc/s (323.50 ms 230918 bc/s)
[INFO ][codegen][00056] #1437 (Normal) java/lang/Shutdown$Lock.<init>()V
[INFO ][codegen][00056] #1437 0.394-0.394 0x2aaab13dd960-0x2aaab13dd96d 0.04 ms 256KB 128205 bc/s (323.54 ms 230906 bc/s)
[INFO ][codegen][00056] #1438 (Normal) java/lang/Shutdown.shutdown()V
[INFO ][codegen][00052] #1434 0.394-0.395 0x2aaab13dd980-0x2aaab13dd99a 0.62 ms 256KB 20900 bc/s (324.17 ms 230503 bc/s)
[INFO ][codegen][00052] #1439 (Normal) java/net/PlainSocketImpl.listen(I)V
[INFO ][codegen][00052] #1439 0.395-0.395 0x2aaab13dd9a0-0x2aaab13dd9f5 0.10 ms 256KB 58252 bc/s (324.27 ms 230448 bc/s)
[INFO ][codegen][00052] #1440 (Normal) java/net/PlainSocketImpl.socketListen(I)V
[INFO ][codegen][00056] #1438 0.394-0.395 0x2aaab13dda00-0x2aaab13ddaab 0.40 ms 256KB 184810 bc/s (324.66 ms 230393 bc/s)
[INFO ][codegen][00056] #1441 (Normal) java/lang/Shutdown.sequence()V
[INFO ][codegen][00056] #1441 0.395-0.395 0x2aaab13ddac0-0x2aaab13ddb8e 0.21 ms 256KB 287736 bc/s (324.88 ms 230430 bc/s)
[INFO ][codegen][00056] #1442 (Normal) java/lang/Shutdown.runHooks()V
[INFO ][codegen][00052] #1440 0.395-0.395 0x2aaab13ddba0-0x2aaab13ddcb3 0.38 ms 256KB 0 bc/s (325.25 ms 230161 bc/s)
[INFO ][codegen][00052] #1443 (Normal) java/net/ServerSocket.accept()Ljava/net/Socket;
[INFO ][codegen][00056] #1442 0.395-0.395 0x2aaab13ddcc0-0x2aaab13ddd97 0.24 ms 256KB 237705 bc/s (325.50 ms 230167 bc/s)

The log lines consist of a start event and an end event. The start event contains the following information:

  • Info tag.
  • Log module identifier – in this case code generator (codegen).
  • Thread ID of the thread generating the code – depending on the system configuration there can be one or more code generator and optimizer threads.
  • Index of the generated method – note that the order of start and corresponding end events of different methods can be arbitrary when multiple threads are used for code generation and optimization.
  • Code generation strategy – in the example above it is too early to obtain run-time information, thus all methods are generated by using the normal or the quick code generator strategy. The quick strategy is applied to methods which are of little importance to the run-time performance.
  • Generated method – identified by the class name, method name and descriptor.

The end event contains the following information:

  • Info tag.
  • Log module identifier – in this case code generator (codegen).
  • Thread ID of the thread generating code.
  • The index of the generated method.
  • Start and end time for the code generation event measured in seconds since JVM start.
  • Address range – this is where the generated code is placed in memory.
  • Code generation time – the number of milliseconds it took for the code generator to turn byte code into machine code for this particular method.
  • Maximum amount of thread local memory used – memory needed to generate the method.
  • Average number of bytecodes per second.
  • Total code generation time.

To turn off optimizations we can use the option -XnoOpt or -XX:DisableOptsAfter=<time>. This makes the programs compile faster but run slower. It can be used when the compile time turns out to be a performance bottleneck (hardly ever the case). It might make sense to change the number of code generation and optimization threads, by using the options -XX:JITThreads=<n> and -XX:OptThreads=<n>. Optimizations are typically memory and CPU intensive, even if the machine has plenty of cores.

Memory Management

The memory management system uses a garbage collector to get rid of stale references, so the user does not have to explicitly remove objects. As with code generation, run-time feedback can be used in a memory management system as well, which leads to a so-called adaptive memory management. For an adaptive memory management system to work, it must correctly utilize run-time feedback, i.e., change the garbage collector strategy when appropriate, re-size the heap when necessary, defragment the heap when it becomes too fragmented and stop the world (halting the executing user threads) when possible. Note that stopping the world is necessary in some stages of the garbage collection process.

Some of the common bottlenecks in software engineering are due to memory leaks, which in a garbage collected world can be detected by the JVM. A garbage collector collects lots of information that can be used for multiple purposes, such as memory leak detection. On the other hand, an automatic memory management system can introduce a degree of non-determinism into a system that requires a constant quality of service in the form of short response times. Also a garbage collector algorithm can break down when a large amount of live data is present on the heap. Of course memory leaks can still appear in a automatic memory system when a program holds on to references that should be reclaimed. So we have to still use some common sense when programming in a language which uses an automatic memory system such as Java.

Allocating objects is about finding an appropriate space on the heap. In order to perform this efficiently the memory management system keeps track of the free space. Note that when the heap becomes too fragmented, after many garbage collection cycles, a too large an object can no longer be allocated and will lead to an out of memory error even though the total amount of free space could easily allocate the object. So next to keeping track of the free space the memory management system needs efficient logic to move objects in order to create larger free spaces on the heap, such that large objects can still be allocated. This process is called compaction (defragmenting the heap). By using object reference information the compaction algorithm moves objects that reference one another next to each other. This assumes that object live times are equal so that whole chunks of referring objects can be reclaimed by the garbage collector, and hereby creating a large enough space for objects to be allocated.

Techniques for automatic memory management have to keep track of live objects. The post Optimizing WebLogic Server Performance – JVM Tuning discusses the techniques used in garbage collection algorithms. The mark and sweep algorithm forms the basis of the JRockit garbage collector. A naive mark and sweep algorithm looks as follows:

Mark:
	add each object in the root set to a queue
		for each object x in the queue
			mark x reachable
			add all object referenced from x to the queue
Sweep:
	for each object x on the heap
		if the object x is not marked, garbage collect it

Note that the computational complexity is a function of both the amount of live data on the heap (mark) and the heap size (sweep). A variant of the naive mark and sweep algorithm is the tri-coloring mark and sweep algorithm. The color associated with an object can be white, gray or black. White objects are eligible for garbage collection. Black objects do not have references to white objects. Gray objects are live objects. The tri-coloring algorithm looks as follows:

Mark:
	all object are white
	color all objects in the root set gray
	while there exist gray objects
		for all gray objects x
			for all white objects y that x references
				color y gray
			if all edges from x lead to another gray object
				color x black
Sweep:
	garbage collect all white objects

So long as black objects never refer to white objects, the garbage collector can continue the marking even though changes to the object graph occur, for example by field assignments. The virtual machine needs to provide information to the memory system in order to identify which positions on the stack contain objects, such that the root set that form the first nodes of the live object graph can be build. Every thread that is executing handles objects, and if a thread is stopped the garbage collector needs to know where the handled objects are. In Java a so-called exact garbage collector can be used that precisely knows where object pointers are. Getting this information to the garbage collector is achieved by generating meta-data along with the code for a particular method.

Most objects can be garbage collected right after they have been allocated, i.e., they are short lived (temporary) objects. A performance improvement can be obtained if the heap is split into multiple parts (generations). New objects are allocated in the young generation of the heap, which is typically much smaller than the old generation. Garbage collection is also performed on a particular generation of the heap, i.e., young collections collect the young generation and promote surviving objects to an older generation. In general, generational garbage collection defaults to using only one young generation, as there are usually objects that are intermediate objects it might be beneficial to have several young generations in the heap that hold gradually aged objects.

A technique that significantly speeds up object allocation is thread local allocation (allocating an object directly on the heap is a synchronized process, i.e., a global heap lock is needed for each allocation – say hello to contention). In JRockit thread local buffers are referred to as Thread Local Areas (TLA). When we need objects in our program an important aspect is from where the object can be retrieved. The underlying CPUs contain data caches. When data is retrieved from memory, it is placed into the caches in order to provide very fast data access. Accessing data from the cache is significantly faster than retrieving it from memory. Usually, a CPU has multiple caches. The first level cache (L1) is the closest to the CPU and thus the fastest, but also the smallest (usually about a few kilobytes). The second level cache (L2) is about a few megabytes and is somewhat slower in accessing data than the first level cache, but still much faster than retrieving it from memory. Intelligently prefetching data into a cache can reduce cache misses and avoid retrieving it from memory. For example, JRockit optimizes thread local areas allocation by heuristically prefetching small chunks of data, which means that following allocations will already have the next chunk of data of the TLA in the cache.

The complexity of a garbage collection algorithm is a function of the live data and not the heap size. Larger heap sizes can thus be utilized as long as the live data is kept in control. An added benefit of using larger heap sizes is the fragmentation which becomes less of an issue. JRockit is the only JVM that supports non-contiguous heaps and is able to utilize memory above and below the operating system kernel. On a 64-bit system the available address space can be tweaked such that there is no need to work around occupied address spaces. Note that pointers on a 64-bit machine are 8-bytes instead of the 4-bytes on a 32-bit machine, such that more bandwidth and more CPU cache space is consumed. Also note that unreferencing a 32-bit pointer is much faster than a 64-bit one. JRockit optimizes the references when running on a 64-bit machine and a heap of less than 4GB is used. In this case it does not make sense to represent objects as 64-bit pointers, as 32-bits are enough (32-bit systems can address up to 4GB of memory). To manually configure compressed references -XXcompressedRefs can be used, for example, -XXcompressedRefs:enable=true,size=32GB, which enables compressed references supporting heaps up to 32GB. Note that it is hardly ever necessary to control this manually as JRockit will enable different flavors of compressed references depending on the maximum heap size.

There are two primary measures of garbage collection performance:

  1. Throughput is the percentage of total time not spent in garbage collection.
  2. Pauses are the times when an application appears unresponsive because garbage collection is occurring.

Thus the trade-off in memory management is between maximizing throughput and maintaining a responsive application (small pause-times). Throughput can be maximized when pause-times up to several seconds are not an issue. In a client/server application on the other hand in which response times are the main critic we can not have pause-times of several seconds. Unfortunately in an imperfect world we can not have both (maximum throughput and low pause-times).

Most systems have as a quality of service a maximum allowed pause-time. JRockit real time gets deterministic pause-times without altering the application, the only thing needed is a pause time target. The price paid for low pause-times is a longer garbage collection time. It is more difficult to garbage collect when the application is running and interrupt the garbage collection. As the quality of service here is pause-time and not throughput, this is not an issue. The issue would be if there was a sudden increase in pause-time.

JRockit provides an easy way to set-up the type of optimization, by using -Xgcprio:<option>. Instead of a fixed garbage collection strategy, it will adapt to the application’s needs. Three options are available:

  • throughput – optimizes the throughput (pause-times are not an issue).
  • pausetime – optimizes the application response time, i.e., pause-times are minimized.
  • deterministic – activates JRockit real time.

Threading

As every object in Java can be a monitor, every object needs to carry synchronization information. This information is stored in a lock word that is stored into the header of each object. Fundamental garbage collection information, such as what garbage collection state a particular object is in must also be available. JRockit stores the garbage collection information in the lock word.

Problems related to threading are often hard to debug and the main denominator here is non-determinism, i.e., concurrency problems depend on timing. JRockit supports dumping stack traces, by using kill -3 <PID>, for all threads in a running application along with lock holder information. By analyzing these thread-dumps it is possible to determine which mutually dependent threads are stuck waiting for a particular resource. Performance issues related to an over use of synchronization can arise. In general, every lock is a potential bottleneck, in that it introduces a section that can only be accessed by one thread at a time. The more threads trying to access the section the more contention will arise. Note that JRockit Mission Control can be used to analyze concurrency problems.

Some threading basics in Java are the following:

  • synchronized keyword – used to define a section that can only be accessed by one thread at a time.
  • java.lang.Thread class – the build-in Java thread abstraction. It contains methods for starting threads and methods for inserting thread code. The most obvious implementation of a java.lang.Thread is for the JVM to use an underlying operating system thread.
  • java.lang.Runnable interface – any class implementing this interface can become a thread. The run method is where the thread code is placed.
  • java.util.concurrent package – contains classes that implement data structures that are useful in concurrent programming. It provides the fundamental building blocks for synchronization, which are optimized for scalability and performance.
  • Semaphores – when a thread tries to acquire a resource and fails it goes to sleep and is awakened when the resource is released. This is what semaphores are for. Each Java object contains the methods wait, notify and notifyAll that enable the implementation of semaphores.
  • volatile keyword – in a multi-threaded environment a write to a memory location is not guaranteed to be seen by all threads simultaneously. By declaring a field volatile ensures that writes to the fields go directly to memory and not to the caches to be written later.

When programming concurrency it is important to think about memory semantics, especially when dealing with an optimizing JIT compiler. When for example fields are not marked as volatile, the JIT compiler may choose as an optimization to load the field’s value from memory only once. In this case the JIT compiler creates its own thread local copy. On CPU architectures, data caches exist. This provides speed-up in data-access and reduces contention on the processor bus. The problem with caching is invalidation especially when two or more systems want to access the same piece of data at the same time. A memory model defines the circumstances under which different CPUs will see the same data or not. Strong memory models automatically make sure that multiple CPUs see the same new data. In strong memory models multiple writes occur almost always in same order as they were placed. In weak memory models there is no guarantee as to when memory should be visible.

Two fundamental lock type implementations are:

  • Thin locks – are used for uncontended locks that are held for a short time. The simplest thin lock implementation is the spinlock. A spinlock waits for its monitor object to be released while burning CPU cycles (while loop). Spinlocks are only optimal when locks are held for a very short time.
  • Fat locks – are used for anything more complex. Fat locks are slower than thin locks to release or acquire, but they have a much better performance in a contended environment. The reason for this is that threads while waiting, sleep instead of spin, and thus burn-up less CPU cycles.

An important optimization in an adaptive environment is the ability to convert thin locks into fat locks and vice verse, depending on load and contention. JRockit uses a spinlock to implement a fat lock when it has just been converted from a thin lock. In an environment with highly contented locks and long waiting periods this can become too slow. In this case it is beneficial to turn this behavior off by using -XX:UseFatSpin=false or make this behavior adaptive by using -XX:UseAdaptiveFatSpin=true.

To examine information related to synchronization behavior we can use -Xverbose:locks. This produces mostly information about lazy unlocking. To control lazy unlocking we can use -XX:UseLazyUnlocking; lazy unlocking can contribute to increased performance as releasing and acquiring the lock is more expensive than doing nothing (no-ops). JRockit can when needed produce profiling information about each lock, which adds typically about 3% overhead. To enable locking instrumentation use -XX:UseLockProfiling=true. This stores information about the lock under which condition it was taken and the number of times it has been in use. To obtain additional information about garbage collector locks we can use -XX:UseNativeLockProfiling=true.

Tuning

Before we can do some tuning we need to have some kind of benchmark. In order to create a benchmark for performance testing we need to know the application’s behavior. In order to accomplish this the application needs to be profiled. Profiling an application will reveal things, such as where the most time is spend, common garbage collection patterns and which locks are contended and which are not. Once enough data has been collected, we are able to isolate a suitable subset of the application for the benchmark. The ideal benchmark is a small self-contained program that emulates the relevant parts of the application.

Deciding on what a benchmark should measure depends on the kind of application. For example, an application that is optimized for throughput needs to perform as many operations in a given time interval. If a fixed response-time is the goal, we need to verify if the service level agreements are met under different kind of loads. Which also brings us to scalability that tells us if the service levels can be maintained if the workload is increased. In some cases the tests tell us that certain parts of the application need to be rewritten, sometimes it is enough to tune the run-time environment by adjusting the JVM parameters.

All the feedback that a run-time collects can be put to use by the JVM for optimization, which JRockit does. In a perfect world no tuning would be necessary in an adaptive environment. But in some cases it is necessary to give the JVM some hints in how it should behave. For example, if the JVM must re-size the heap or not, or if fragmentation is not an issue. Before we can tune anything we have to know the bottlenecks. For example, when network latency or a particular database query is the issue at hand, it does not make sense to tune the JVM. When we indeed need to tune the JVM we can tune for throughput or pause-times.

A good place to start is to adjust the heap size when too many garbage collection cycles are run. When tuning for low pause-times it is good practice to set -Xms and -Xmx to equal values. To pick an appropriate garbage collection algorithm depends on the type of optimization we want to achieve. When throughput is the optimization that is aimed for we can use -Xgcprio:throughput. If on the other hand we have to optimize pause-time we can use -Xgcprio:pausetime or -Xgcprio:deterministic. By setting an additional -XpauseTarget we specify a certain service level agreement. Note that values less than 200ms are only supported by JRockit real time, i.e., the deterministic option.

One of the worst garbage collection bottleneck is compaction, since compaction is not a fully concurrent operation. If any additional information is known about fragmentation and object sizes it is beneficial to tune compaction, by using -XX:compaction. The compaction algorithm divides the heap into equally large parts, by default there are 4096 heap parts. When compaction is an issue it might help to increase this number or if the heap gets too fragmented over time decrease it. To alter the number of heap parts we can use -XXcompaction:heapParts=1024. Note that in the case of pause-time optimization the compaction is dynamically sized in order to meet the provided service level agreement, by setting a -XpauseTarget. When optimizing for pause-times, compaction can be aborted if the world has been stopped for too long. This can be accomplished by setting -XXcompaction:abortable. By setting -XXcompaction:full we are optimizing the throughput, i.e., a full compaction is forced at every garbage collection thus keeping fragmentation at a minimum.

It is always wise to turn off System.gc calls, as a System.gc may or may not do a full garbage collection when invoked. This can be accomplished by using -XX:AllowSystemGC=false.

To set the size of the young generation we can use -Xns. The young generation is normally adaptively re-sized by the JVM. By using -Xns to appropriately set the size of the young generation may be beneficial when the application produces large amounts of temporary objects. When tuning for high throughput it is generally a good idea not use a generational garbage collector.

Each thread allocates objects in a thread local area that is promoted to the heap when full. The size of the thread local area can be controlled by using -XXtlaSize. The JVM allocates larger objects direct on the heap that tend to fill up the thread local area too quickly. Large objects can be a problem and when something is known about common object sizes it may be beneficial to set the thread local area appropriately, for example, -XXtlaSize,min=16k,preferred=256k,wasteLimit=8k a size of 256k is preferred but a size of 16k is acceptable and never waste more than 8k on the thread local area, but allocate the object directly on the heap.

JRockit, just as any other JVM, assumes that it has the machine all to itself and uses a number of parallel garbage collection threads limited by the operating system and physical hardware (typically the garbage collection threads are equal to the number of available cores). To adjust the number of garbage collection threads we can use, for example, -XgcThreads:8, which uses eight garbage collection threads. Setting it too low can raise the issue that the garbage collector cannot keep up with the number of dead objects and needs to do emergency full garbage collections, which leads to increased pause-times.

Examples

To optimize for throughput we can use:

-Xms1024m -Xmx1024m -Xgcprio:throughput

In the case of pause-time optimization we can use:

-Xms1024m -Xmx1024m -Xns256m -Xgcprio:pausetime -XpauseTarget:200ms

When a deterministic pause-time is required we can use:

-Xms1024m -Xmx1024m -Xns256m -Xgcprio:deterministic -XpauseTarget:20ms

When these initial settings are not sufficient, we need to add additional tuning such as object allocation (-XXtlaSize) and/or compaction (-XX:compaction).

Common Bottlenecks

When tuning the application for performance it is all about understanding bottlenecks and how they can be avoided. Here are some of them:

  • Finalizers are unsafe in that they can resurrect objects and interfere with garbage collectors. Avoid finalizers.
  • Too many live java.lang.Reference objects will make the garbage collector run slower as these type of objects need to be treated in a special way. In general, java.lang.Reference objects are more expensive than normal objects to book keep. To obtain information about java.lang.Reference objects we can use -Xverbose:refobj.
  • Pooled objects will cause objects to live longer and thus eventually be promoted to the old generation, which introduces overhead and fragmentation. Apart from this they also interfere with garbage collection workloads and heuristics (large amounts of live data is a garbage collection bottleneck).
  • Choose the right algorithms and data structures, for example, a hash table is much faster when looking up elements than a linked list.
  • Do not use System.gc. There is no guarantee that it will do anything and if it does it will probably do more than we wanted.
  • In the case of thread implementations some kind of context switch has to take place, during which no useful program execution can be performed. The number of switches grows proportionally as the number of scheduled threads grows.
  • Contended locks are an obvious bottleneck, as this indicates that multiple threads want to access the same resource.
  • Handling exceptions take time, to get an insight we can use -Xverbose:exceptions or -Xverbose:exceptions=debug.
  • Large objects have to be allocated directly on the heap and are not placed in the thread local area. Large objects on the heap contribute to fragmentation because they might not fit into a free area, as a result the allocation time increases on a fragmented heap.
  • Too small a heap triggers frequent garbage collections. Too large a heap can lead to longer garbage collection times. It makes sense to do some profiling and determine an optimum maximum heap size.
  • The main contributor to run-time complexity in a memory management system is not the heap size, but the number of live objects.

Command-line utility JRCMD

By using the command-line utility jrcmd, we can send commands to locally running JRockit instances. By using the utility without any additional arguments, we obtain a list of all running JRockit JVMs, for example,

[oracle@edu-wls-rh bin]$ ./jrcmd
9073 weblogic.NodeManager -v
13450 jrockit.tools.jrcmd.JrCmd
9264 weblogic.Server
9140 weblogic.Server

We see that one nodemanager and two WebLogic servers (osb_server1 with PID 9264 and adminserver with PID 9140) are running. The JRockit instance on which we want to execute diagnostic commands is selected by passing the PID with the command that needs to be executed, for example,

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 help
9264:
The following commands are available:
        kill_management_server
        start_management_server
        print_object_summary
        memleakserver
        print_class_summary
        print_codeblocks
        dump_codelayout
        dump_codelist
        dump_codemap
        print_codegenlist
        print_vm_state
        print_utf8pool
        check_flightrecording
        dump_flightrecording
        stop_flightrecording
        start_flightrecording
        print_properties
        hprofdump
        print_threads
        datadump_request
        runsystemgc
        runfinalization
        heap_diagnostics
        oom_diagnostics
        print_exceptions
        version
        timestamp
        command_line
        sanity
        verbosity
        set_filename
        help
        print_memusage
        set_vmflag
        list_vmflags
For more information about a specific command use 'help command'.
Parameters to commands are optional unless otherwise stated.

The heap_diagnostics command provides detailed information about the heap usage. When the command is executed a full garbage collection is triggered during which the information is collected. The output contains three sections:

  • General information – available memory and heap usage.
  • Detailed heap statistics – lists the amount of heap that instances of a particular class occupy, the amount of memory in kilobytes that these instances consume, the total number of instances alive, deviation since the last heap_diagnostics invocation, and the name of the class.
  • Reference object statistics – lists the amount of total instances, how many instances are reachable, how many instances are unreachable and thus eligible for garbage collection, how many reference objects were found unreachable during this garbage collection, how many objects were activated before this garbage collection, how many instances are pointing to null and the instance types.
[oracle@edu-wls-rh bin]$ ./jrcmd 9264 heap_diagnostics > /home/oracle/temp/jrcmdout.txt

9264:
Invoked from diagnosticcommand
======== BEGIN OF HEAPDIAGNOSTIC =========================

Total memory in system: 4141600768 bytes
Available physical memory in system: 405938176 bytes
-Xmx (maximal heap size) is 1073741824 bytes
Heapsize: 1073741824 bytes
Free heap-memory: 952059592 bytes

Memory layout:
00400000-00413000 r-xp 00000000 08:03 1048620                            /home/oracle/aqualogic/jrrt-4.0.1-1.6.0/bin/java
00612000-00613000 rw-p 00012000 08:03 1048620                            /home/oracle/aqualogic/jrrt-4.0.1-1.6.0/bin/java
07a51000-10cc0000 rw-p 07a51000 00:00 0                                  [heap]
40040000-40042000 rwxp 40040000 00:00 0
...
2b19b3d84000-2b19b3d85000 rw-p 0000c000 08:03 1080578                    /home/oracle/aqualogic/jrrt-4.0.1-1.6.0/jre/lib/amd64/libutil.so
7fffaeb8c000-7fffaeba0000 rwxp 7ffffffe6000 00:00 0                      [stack]
7fffaeba0000-7fffaeba5000 rw-p 7fffffffa000 00:00 0
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]

--------- Detailed Heap Statistics: ---------
26.8% 27311k   265153    -56k [C
 8.3% 8498k   362607    -14k java/lang/String
 5.8% 5873k    68152     +5k [Ljava/util/HashMap$Entry;
 4.9% 4989k    54887     +5k [Ljava/lang/Object;
 4.2% 4286k   182902    -13k java/util/HashMap$Entry
 4.1% 4144k    14657    -70k [B
 3.4% 3472k    29633     +1k java/lang/Class
 2.3% 2355k    50242     +1k java/util/HashMap
 1.9% 1934k    61914    +17k java/util/LinkedHashMap$Entry
 1.6% 1617k    51760     +0k java/util/concurrent/ConcurrentHashMap$Segment
 1.5% 1496k    63833     +0k java/util/Hashtable$Entry
 1.3% 1313k    56044     +0k java/util/concurrent/locks/ReentrantLock$NonfairSync
 1.3% 1289k     8656     +0k [I
 1.1% 1074k    51760     +0k [Ljava/util/concurrent/ConcurrentHashMap$HashEntry;
 1.0% 1053k    22470     +0k java/lang/ref/SoftReference
 1.0% 980k    12454     +0k [Ljava/util/Hashtable$Entry;
 0.9% 955k    24463     +0k java/lang/ref/WeakReference
 0.9% 933k    39810     +0k javax/management/ObjectName$Property
 0.9% 923k    16893     +0k java/util/LinkedHashMap
 0.9% 890k    11402    +20k java/lang/reflect/Method
 0.8% 831k    35491     +0k javax/xml/namespace/QName
 0.8% 805k    15288     +6k [Ljava/lang/String;
 0.6% 653k     4425     +3k [J
 0.5% 532k    14167     +0k [Ljavax/management/ObjectName$Property;
 0.5% 506k    21604     +0k java/util/ArrayList
 0.5% 464k    11888     +0k java/util/Hashtable
 0.4% 454k    19409     +0k java/util/LinkedList$Entry
 0.4% 400k    25646     +0k oracle/dms/aggregator/NameTable
 ...
 0.0% 0k        1     +0k sun/reflect/GeneratedConstructorAccessor325
 0.0% 0k        1     +0k sun/reflect/GeneratedConstructorAccessor25
     101949kB total ---

--------- End of Detailed Heap Statistics ---

----- Reference Objects statistics separated per class -----
    Total Reach Act PrevAct Null
    ----- ----- --- ------- ----
----- End of Reference Objects statistics -----
----- Reference Objects statistics separated per class -----
    Total Reach Act PrevAct Null
    ----- ----- --- ------- ----
Soft References:
     1056  1056   0       0    0 Total for all Soft References

 java/lang/ref/SoftReference =>
      694   694   0       0    0 Total
      273   273   0       0    0 => java/lang/reflect/Method
 ...

Weak References:
    30541 30541   0       0    0 Total for all Weak References

 java/lang/ref/WeakReference =>
    23758 23758   0       0    0 Total
    18484 18484   0       0    0 => java/lang/Class
 ...

Phantom References:
     2550    45   0    2505    0 Total for all Phantom References

 com/bea/xbean/store/Locale$Ref =>
     1381     0   0    1381    0 Total
     1381     0   0    1381    0 => com/bea/xbean/store/Cursor
 ...

Cleared Phantom:
       50    49   1       0    0 Total for all Cleared Phantom

 jrockit/vm/ObjectMonitor =>
       50    49   1       0    0 Total
        7     7   0       0    0 => weblogic/work/ExecuteThread
  ...

Finalizers:
     2618  2616   2       0    0 Total for all Finalizers
     1039  1039   0       0    0 => java/util/zip/Inflater
      858   858   0       0    0 => java/util/jar/JarFile
	 ...

Weak Handles:
    62760 62760   0       0    0 Total for all Weak Handles
    28262 28262   0       0    0 => java/lang/String
    21633 21633   0       0    0 => sun/misc/Launcher$AppClassLoader
     6709  6709   0       0    0 => weblogic/utils/classloaders/GenericClassLoader
     2671  2671   0       0    0 => weblogic/utils/classloaders/ChangeAwareClassLoader
     2094  2094   0       0    0 => sun/reflect/DelegatingClassLoader
      608   608   0       0    0 => oracle/jrockit/jfr/events/EventHandlerCreator$EventInfoClassLoader
      592   592   0       0    0 => java/net/URLClassLoader
       97    97   0       0    0 => sun/misc/Launcher$ExtClassLoader
       56    56   0       0    0 => org/apache/openjpa/lib/util/MultiClassLoader
       21    21   0       0    0 => com/bea/common/security/storeservice/util/StoreServiceDelegate$StoreServiceClassLoader
        5     5   0       0    0 => javax/management/remote/rmi/NoCallStackClassLoader
        4     4   0       0    0 => java/lang/ThreadGroup
        4     4   0       0    0 => java/lang/Class
        1     1   0       0    0 => com/octetstring/vde/LDAPServer
        1     1   0       0    0 => bea/jmapi/JVMImpl
        1     1   0       0    0 => java/lang/Thread
        1     1   0       0    0 => com/bea/common/security/utils/SAML2ClassLoader
Soft reachable referents not used for at least 182.657 s cleared.
123 SoftReferences was soft alive but not reachable (when found by the GC),
49 was both soft alive and reachable, and 884 was not soft alive.
----- End of Reference Objects statistics -----

Dark matter: 7567384 bytes

Heap size is not locked

======== END OF HEAPDIAGNOSTIC ===========================

To start the memory leak server we can use

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 memleakserver port=8888
9264:
Memleak started at port 8888.

To stop the memory leak server we can use

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 memleakserver port=8888
9264:

Stopping the server does not produce any output. To generate a heap dump for offline analysis we can use

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 hprofdump filename=/home/oracle/temp/dump.hprof
9264:
Wrote dump to /home/oracle/temp/dump.hprof

The output in hprof format can be analyzed by an appropriate memory analysis tool, for example by using YourKit:

which shows how objects are reachable and

which gives some insight in memory leaks.

By using the command check_flightrecording we can get information on running flight recordings, for example,

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 check_flightrecording
9264:
Recording : id=1 name="WLDFDiagnosticImageRecording_base_domain_osb_server1_" duration=0s dest="/home/oracle/aqualogic/user_projects/domains/base_domain/servers/osb_server1/logs/diagnostic_images/__tmp2162488933745138504.jfr" compress=false (running)

We can also start a flight recording, by using start_flightrecording, for example,

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 start_flightrecording name=test settings=/home/oracle/aqualogic/jrrt-4.0.1-1.6.0/jre/lib/jfr/profile.jfs duration=60s filename=/home/oracle/temp/test.jfr compress=false
9264:
Started recording 5

[oracle@edu-wls-rh bin]$ ./jrcmd 9264 check_flightrecording
9264:
Recording : id=1 name="WLDFDiagnosticImageRecording_base_domain_osb_server1_" duration=0s dest="/home/oracle/aqualogic/user_projects/domains/base_domain/servers/osb_server1/logs/diagnostic_images/__tmp2162488933745138504.jfr" compress=false (running)
Recording : id=5 name="test" duration=60s dest="/home/oracle/temp/test.jfr" compress=false (running)

The resulting file (test.jfr) can be analyzed by using JRockit Mission Control:

Which visualizes garbage collection and corresponding pause-times. Note apart from a peak of 80ms we have a nice clean average of 20ms.

Which gives us insight in which methods the most time is spend.

Which gives us insight in how threads are communicating. By selecting different events we can analyze different aspects. Note also that the range can be adjusted, so that we can zoom in on the interesting parts.

Is this not just amazing, let us quickly turn our attention to this incredible tool and see its possibilities.

JRockit Mission Control

As can noted from the discussion above, JRockit is required to constantly monitor the Java application in order to make the right decisions in the adaptive process. The information that JRockit measures are things, such as, find which methods are spending time executing code, keep track of memory usage and memory allocation etcetera. The data that JRockit collects is a very good source of information for profiling the application or when performing diagnostics.

JRockit Mission Control is a set of tools that enable monitoring, managing and profiling applications that run on JRockit. These tools retrieve profiles that are a good approximation of the actual run-time behavior with very little overhead. The data captured by JRockit Mission Control aims to provide a statistically accurate representation of what is going on in the JRockit JVM. Some use cases of using JRockit Mission Control are:

  • Finding hotspots – which method is being executed the most and would thus be an important candidate to optimize performance for?
  • Tracking latencies – find out why application throughput is worse than expected or why the response time is so long?
  • Memory profiling – what is causing the garbage collections?
  • Heap analysis – what type of data is on the heap and choose the right garbage collector strategy and compaction strategy?
  • Tuning memory – reveal why the garbage collector fails to keep up and act accordingly?

JRockit Mission Control consists of tools such as:

  • Management console – used for monitoring the JVM and the running applications.
  • Flight recorder – a low overhead, always on, recorder that creates a profile of what JRockit has been doing. Note that this is the follow-up to the run-time analyzer.
  • Memory leak detector – visualizes memory usage trends, relationships between class instances, etcetera.

In the tests we are going to use the Oracle Service Bus set-up presented in the post Setting-up a High Available Tuned SOA Environment. To start JRockit Mission Control we can use the jrmc script, located in the <jrockit-home>/bin directory. To enable remote management we have to start the external management agent by using -Xmanagement, for example,

if [ "${SERVER_NAME}" = "AdminServer" ] ; then
# For measurement purposes enable an external agent for JRockit Mission Control
    MEASUREMENT_ARGS="-Djava.rmi.server.hostname=172.31.0.108 -Xmanagement:ssl=false,authenticate=false,port=7091,autodiscovery=true"
    export MEASUREMENT_ARGS
# Extra garbage collection debugging parameters
    # MEASUREMENT_ARGS="${MEASUREMENT_ARGS} -Xverbose:gc,gcpause,memdbg"

# The USER_MEM_ARGS variable is used to override the standard memory arguments passed to java.
    USER_MEM_ARGS="${MEASUREMENT_ARGS} -Xms1024m -Xmx1024m -Xns256m -XXkeepAreaRatio:25 -XXgcThreads:4 -Xgcprio:pausetime -XpauseTarget:200ms"
    export USER_MEM_ARGS
# If we want to use debugging we have to make sure the debug port is different for every server
    DEBUG_PORT="8453"
    export DEBUG_PORT
fi

if [ "${SERVER_NAME}" = "osb_server1" ] ; then
# For measurement purposes enable an external agent for JRockit Mission Control
    MEASUREMENT_ARGS="-Djava.rmi.server.hostname=172.31.0.108 -Xmanagement:ssl=false,authenticate=false,port=7092,autodiscovery=true"
    export MEASUREMENT_ARGS
# Extra garbage collection debugging parameters
    # MEASUREMENT_ARGS="${MEASUREMENT_ARGS} -Xverbose:gc,gcpause,memdbg"

# The USER_MEM_ARGS variable is used to override the standard memory arguments passed to java.
    USER_MEM_ARGS="${MEASUREMENT_ARGS} -Xms1024m -Xmx1024m -Xns256m -XXkeepAreaRatio:25 -XXgcThreads:4 -Xgcprio:pausetime -XpauseTarget:200ms"
    export USER_MEM_ARGS
# If we want to use debugging we have to make sure the debug port is different for every server
    DEBUG_PORT="8454"
    export DEBUG_PORT
fi

Note that the above configuration is added to the setDomainEnv file, located in the <domain-home>/bin directory. By using autodiscovery=true we enable the network auto discovery feature JRockit Discovery Protocol (JDP), which makes it easier for JRockit Mission Control to discover remote JVMs automatically. An overview of the -Xmanagement options can be found here.

To get some real-life data for JRockit Mission Control, we set-up the following test environment:

  • Create a web service
  • Create an OSB configuration, by specifying a message flow for the proxy service and specify some mediation.
  • Create a client for the proxy service, for example, a servlet.
  • Use the Grinder to create some load.

The web service looks as follows:

package model.logic;

import java.util.ArrayList;
import java.util.List;
import javax.jws.WebMethod;
import javax.jws.WebParam;
import javax.jws.WebResult;
import javax.jws.WebService;
import model.entities.Department;
import model.entities.Employee;

@WebService(name = "Company", serviceName = "CompanyService", portName = "CompanyPort", targetNamespace = "http://www.alsb.com")
public class CompanyBean {

	private volatile List<Department> departments;

	@WebMethod(operationName = "addDepartment")
    public void addDepartment(@WebParam(name = "department", targetNamespace = "http://www.alsb.com") Department department) {
        if (departments != null) {
            departments.add(department);
        } else {
            departments = new ArrayList<Department>();
            departments.add(department);
        }
    }

    @WebMethod(operationName = "addEmployeeToDepartment")
    public void addEmployeeToDepartment(@WebParam(name = "employee", targetNamespace = "http://www.alsb.com") Employee employee,
    		@WebParam(name = "department", targetNamespace = "http://www.alsb.com") Department department) {
        Department departmentFromList = departments.get(departments.indexOf(department));
        if (departmentFromList != null) {
            if (departmentFromList.getEmployees() == null) {
                departmentFromList.setEmployees(new ArrayList<Employee>());
            }
            departmentFromList.getEmployees().add(employee);
        }
    }

    @WebMethod(operationName = "removeEmployeeToDepartment")
    public void removeEmployeeFromDepartment(@WebParam(name = "employee", targetNamespace = "http://www.alsb.com") Employee employee,
    		@WebParam(name = "department", targetNamespace = "http://www.alsb.com") Department department) {
        Department departmentFromList = departments.get(departments.indexOf(department));
        if (departmentFromList != null) {
            departmentFromList.getEmployees().remove(employee);
        }
    }

    @WebMethod(operationName = "getDepartments")
    @WebResult(name = "departments", targetNamespace = "http://www.alsb.com")
    public List<Department> getDepartments() {
        return departments;
    }

    @WebMethod(operationName = "getEmployeesFromDepartment")
    @WebResult(name = "employees", targetNamespace = "http://www.alsb.com")
    public List<Employee> getEmployeesFromDepartment(@WebParam(name = "department", targetNamespace = "http://www.alsb.com") Department department) {
        Department departmentFromList = departments.get(departments.indexOf(department));
        return departments.get(departments.indexOf(department)).getEmployees();
    }
}

Note that the web service is stateful, also note that we cannot remove departments; so we have a possible memory leak. To make things thread-safe we have used volatile, so probably when the load is high enough there will be some thread contention.

The OSB configuration is as follows:

  • Message flow:
    • Branch node that defines the web service operations.
    • Each branch contains a route node with corresponding request/response paths.
  • Mediation:
    • The response for the addDepartment operation is adjusted such that the message “Department $departmentnaam toegevoegd” is added to the response.
    • The request for the addEmployeeToDepartment operation is adjusted such that employees for which the name does not start with a capital are not added. When this happens the response will contain an explanatory message.
    • For the getDepartments operation an error will be raised when no departments are present.
    • The request for the getEmployeesFromDepartment is adjusted such that the employee name is converted to uppercase characters.
    • The removeEmployeeFromDepartment logs the fact that an employee has been removed.

To create a client for the proxy service, we first have to create web service artifacts by using the wsimport command-line utility, for example,

[oracle@edu-wls-rh bin]$ ./wsimport -keep -d /home/oracle/temp/LoadTest/src http://172.31.0.108:8011/CompanyOSBProject/CompanyProxyService/?wsdl
parsing WSDL...
generating code...
compiling code...

The test servlet looks as follows:

package test;

import com.alsb.Company;
import com.alsb.CompanyService;
import com.alsb.Department;
import com.alsb.Employee;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.xml.namespace.QName;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

public class TestServlet extends HttpServlet {

    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        try {
			URL url = new URL("http://172.31.0.108:8011/CompanyOSBProject/CompanyProxyService/?wsdl");
			QName name = new QName("http://www.alsb.com", "CompanyService");

			CompanyService companyService = new CompanyService(url, name);

			Department department = createDepartment();
			Employee employee = createEmployee();

			Company port = companyService.getCompanyPort();
			port.addDepartment(department);

            port.addEmployeeToDepartment(employee, department);

			port.getDepartments();

			port.getEmployeesFromDepartment(department);

			port.removeEmployeeToDepartment(employee, department);
		} catch (MalformedURLException e) {
			e.printStackTrace();
		}
    }

	private static Department createDepartment() {
		Department department = new Department();
		department.setCode(80);
		department.setNaam("AquaLogic");
		return department;
	}

	private static Employee createEmployee() {
		Employee employee = new Employee();
		employee.setSofinummer(123456789);
		employee.setNaam("Bea");
		return employee;
	}
}

Deploy the servlet to a different server than the server on which the service bus is running.

To generate some load we can use the Grinder. We are going to set-up the Grinder on a separate machine, for example a Windows client. In order to do this we first create the following scripts:

@REM setGrinderEnv.cmd
set GRINDERPATH=C:\documents\Frameworks\grinder-3.4
set GRINDERPROPERTIES=%GRINDERPATH%\examples\grinder.properties
set CLASSPATH=%GRINDERPATH%\lib\grinder.jar;%CLASSPATH%
set JAVA_HOME=C:\bea\jrockit_160_17_R28.0.0-679
PATH=%JAVA_HOME%\bin;%PATH%
@REM startConsole.cmd
call C:\documents\Frameworks\grinder-3.4\setGrinderEnv.cmd
java -cp %CLASSPATH% net.grinder.Console
@REM startAgent.cmd
call C:\documents\Frameworks\grinder-3.4\setGrinderEnv.cmd
echo %CLASSPATH%
java -cp %CLASSPATH% net.grinder.Grinder %GRINDERPROPERTIES%

The following script (test.py) calls the servlet

from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
from net.grinder.plugin.http import HTTPRequest

test1 = Test(1, "Request resource")
request1 = test1.wrap(HTTPRequest())

class TestRunner:
    def __call__(self):
        result = request1.GET("http://172.31.0.108:7001/LoadTest/test")

The grinder.properties file contains the following entries:

# The file name of the script to run.
grinder.script = test.py

# The number of worker processes each agent should start.
grinder.processes = 1

# The number of worker threads each worker process should start.
grinder.threads = 1

# The number of runs each worker process will perform. When using the console this is set to 0, meaning: run until the console sends a stop signal.
grinder.runs = 0

To start the test first run startConsole and then run startAgent. Use the Grinder console to start and stop the test. By using JRockit Mission Control we are going to diagnose, profile and hopefully draw some useful conclusions on various bottlenecks. The tools we are going to use are the management console, the flight recorder and the memory leak detector.

Management Console

The management console can be accessed by right clicking on one of the detected JVM processes and choose start console from the context menu.

The dashboard contains three dials: used Java heap, JVM CPU usage and live set plus fragmentation. If the CPU usage is high it is possible we need to invest in better hardware or look over the data structures and algorithms used in the application. A good thing to do in this case is to run a flight recording and find out where all the CPU cycles are spend. When a large percentage of the heap is filled with live objects, the garbage collection overhead will increase (garbage collections have to run more frequently). If the live set and fragmentation dial stays on a high level and we have a garbage collection issue, increasing the heap will resolve this issue. If the live set and fragmentation dial increases over time, there is a possibility for a memory leak in the application to be present, which can be analyzed by using the memory leak detector. Note that if no value is shown in the live set and fragmentation dial it means no garbage collection has been run yet.

The two charts below the dashboard respectively show CPU and memory usage. The CPU usage is displayed as a percentage of full CPU saturation and is an average over all available cores. The memory chart displays the heap usage as a percentage of the total heap size. A way to get an insight in the live set and fragmentation is to check how the heap usage changes over time. We can also add the live set and fragmentation to the memory chart by clicking add and open the tree jrockit.synthetic, LiveSetSubscription and select LiveSet.

In the memory chart the occupied heap is steadily growing during the test. This growth is probably due to the increased load. When we stop the test the memory gets constant. At this point we also stop and start the application so that the XML message and processing in the Oracle Service Bus is brought back to normal proportions. After we start the test again the load is increased and as a result we see the sawtooth again until a full garbage collection occurs. Note that a growing occupied heap is not necessarily a memory leak it could also be due to increased load. When we are dealing with a memory leak and nothing is undertaken a very likely out of memory error will be insurmountable.

The run-time section contains tabs that visualize information about the JRockit run-time. The system tab contains some useful information about system statistics. The memory tab contains among others tables containing memory statistics and garbage collector related attributes. Some of the garbage collector attributes can be changed at run-time, such as, the allocated heap size, the garbage collector strategy and the garbage collector heuristic (throughput or pausetime). Note that it is not possible to change the heuristic to deterministic, as a lot of data structures and configurations are created at start-up.

When changing the strategy an implicit heuristic change can occur as well, for example when concurrent is used the heuristic will be pausetime and when parallel is used the heuristic will be throughput. Some remarks are in order:

  • When changing the heuristic to pausetime, necessary data structures for abortable compaction will not be set-up and will thus not be available to the garbage collector.
  • TLA sizes are calculated based upon start-up parameters and will not be recalculated when the strategy or the heuristic is changed.

In the threads environment all available threads are listed in a table along with information related to thread state etcetera. When we check the CPU profiling, the total CPU usage per thread will be shown. By enabling allocation profiling we get information about how much memory has been allocated for each thread. Note that this value shows the total amount of memory allocated by this thread, not the amount currently alive.

The advanced section contains the following:

  • Method profiler – performs exact method profiling, which means that the profiler will report exact invocation counts and total execution times for the selected methods. Note that the overhead for exact method profiling is hard to estimate, but when all hot methods in a system are being profiled the overhead can be large. It is probably better to use the flight recorder, which uses sampling based method profiling. Also because it is hard to determine which method to profile until we have actually done some profiling.
  • Exception counter – counts the number of exceptions that are thrown.
  • Diagnostic commands – are the set of commands that can send to JRockit by using, for example, the command-line utility JRCMD.

Flight Recorder

What we want to know if things go down the drain are usually questions such as:

  • Where are the hot spots within the applications?
  • How is the memory management?
  • Are there any latency related problems?
  • Are there memory leaks present.

A tool such as the flight recorder gathers a lot of information on how JRockit and the running applications behave. So naturally other questions, related to the questions above, arise as well, such as:

  • How to find hot spots in the application?
  • How to interpret memory related information?
  • How to pin-point latency related problems?
  • How to detect memory leaks (or indications thereof)?

The flight recorder can be started and stopped by using either JRCMD (as we already saw in an example presented in the JRCMD section) or by using the JRockit Mission Control GUI. When using the GUI, right-click on the appropriate JVM, for which we want to obtain detailed information, and select start flight recording from the context menu, choose the template that best meet your needs (the profiling normal template provides a general-purpose profiling template that provides a good starting point) and set the time fixed recording to, for example, 5 minutes.

Just like in the management console there are several sections providing different kinds of information.

The general section contains useful information for checking the system health. What to look for depends on the system. A system should be well utilized and not be saturated. A good goal would be to keep the live set and fragmentation to half or less of the maximum heap, which keeps the garbage collection ratio down. For an application with low allocation rates, the occupied heap can be allowed to be higher. An application optimized for throughput, such as batches, would want the CPU to be saturated. In this case the garbage collection pause-time is of a less concern.

The memory section contains memory-related information, such as heap usage and garbage collections. A good thing to keep an eye on is the time spent in garbage collection in relation to the duration of the entire recording, if this is higher than 15-20% it usually means that a significant allocation pressure is on the JVM. In the garbage collection graph tab all the garbage collections that occurred during the recording are presented graphically. It is a good idea to leave out the first and last garbage collections from the analysis, as these usually are full collections with a significant compaction in order to gather extra data. This could break the service level agreement for the pause-time when using the deterministic garbage collector. In an application with a very high pause-to-run ratio the occupied heap would be close the maximum heap size. A good start when this happens is to increase the heap to obtain a performance gain.

The garbage collection advanced tab gives detailed information about the particular collection that occurred during the recording. Such as information about the different pauses that occurred during the collection (stopped phases), the stack trace for the garbage collection (GC request trace) etcetera. Note that quite a large portion of the garbage collection pause is spent handling finalizers, one way to optimize the application is to reduce finalizers.

If we want to reduce the number of garbage collections we need to find what is causing the garbage collection to occur. Thus we have to pin-point where the allocation takes place. Allocation can be viewed on a per thread basis, which can help to determine where to start tuning in order to reduce the stress on the memory system. The heap contents tab contains information about the memory distribution. If the heap is heavily fragmented we can either tune the compaction or change the object allocation. The object statistics tab shows what was on the heap at the beginning and at the end of the recording. If there is a large difference between the two it can either mean that we are dealing with a possible memory leak or that some large operation was performed that required a lot of memory.

The code section contains information about the code generator and method sampler. The overview tab is a good place to start when dealing with applications for which the CPU is the limiting factor (faster CPU means a higher throughput). The tab shows an overview of where the JVM spends most of its time executing code. Two tables are presented one contains the hot packages the other the hot classes.

The hot methods tab contains information provided by the code optimizer. If a lot of samples from a particular method are present and a lot of the traces share the same origin, an optimization can be obtained by either manually optimizing the method or by reducing the amount of calls along the call chain. In the hot methods tab most time (2.19%) is spend in the method weblogic.utils.http.HttpChunkInputStream.read(byte[], int, int) the second one being jrockit.vm.Allocator.allocObjectOrArray(long, long, int, Boolean) (1.48%) which are not alarmingly high. Note that these methods are part of the Java system (application server and JVM), thus we are facing a dilemma in that we cannot optimize these methods. Sometimes a solution is beyond our immediate control.

The exception tab contains detailed information about the exceptions being thrown. Note that we used the profiling normal template and in this case only the total number of exceptions and the number of exceptions per second is being measured. If we want to obtain fine grained exception information we have to use the profiling with exceptions template. The amount of exceptions thrown per second depends both on the hardware and the application, i.e., faster hardware may execute the application faster with the result that more exceptions are thrown. A higher value is always worse than a lower one on identical set-ups. The JVM gambles that exceptions do not occur frequently and needs to adapt if they do. When too many exceptions are thrown we have to investigate why, because performance will suffer.

The code generation tab show information from the JIT compiler and the code optimizer.

The CPU/Threads section contains measured data related to threads, such as locking, contention and latencies. The overview tab shows fundamental thread and CPU usage information. A saturated CPU is usually a good thing, that means we are fully utilizing the hardware.

The hot threads tab shows a table that contains the hot threads. By clicking on a particular thread the hot traces corresponding to the thread are revealed. It also shows the context switch rate. A high amount of context switches per seconds may not be a problem, but a better synchronization behavior may lead to a better throughput.

In the contention tab, blocking information can be viewed. It is subdivided into top blocking locks, top blocked thread (thread blocked, waiting on acquiring monitors) and top blocking threads. It shows per category the count, average duration and the total duration.

The latency tab shows an aggregation per event type (such as Java Blocked) and the aggregated stack traces for the selected event type. The latency tool helps in hunting down latencies introduced by the application. In general, it is not enough to have a guaranteed garbage collection pause-time of a few milliseconds, when the application itself blocks for a significant amount of milliseconds when, for example, waiting for I/O.

The lock profiling tab contains detailed information about locking when enabled. To enable lock profiling we have to run the flight recording by using the profiling with lock template and the JRockit JVM has to be started with the -XX:+UseLockProfiling option.

The events section contains general purpose tabs for visualizing events. Different types of events can be selected from different types of resources. The event types chart on the overview tab, for example, can be adjusted as one sees fit. Note that the chart shows the number of events logged. In the example above we have selected events related to garbage collection and compaction.

The logs tab shows all the selected events in a table.

In the graph tab events are aggregated per thread. In the example above we have enabled the events Java blocked and garbage collection. This can be useful for seeing if a latency event was caused by the garbage collection. Note that the color green relates to threads executing Java code. The graph view is good for providing an overview of the latency and thread behavior of the application.

The histogram tab can be used to build histograms based on selected events a few useful ones are: Java application / object allocation outside TLA – Object class: for finding what object types are allocated directly on the heap, Java application / Java wait – monitor class: for finding where the application spends most time in wait, and Java application / Java blocked – monitor class: for finding which locks are mostly blocking in the application.

One thing worth mentioning is the operative set. The operative set is a collection of attributes in which we are interested. For example, in the histogram presented above we can add java.lang.String to the operative set, by right clicking java.lang.String in the histogram table and choose from the context menu: operative set, add selection. On other event tabs we can now check the show only operative set checkbox to analyze events related to java.lang.String. A very handy usecase for the histogram is to investigate which events are of a particular interest and add those to the operative set so that these can be analyzed in-depth on the other event tabs.

Memory Leak Detector

Whenever allocated memory is no longer in use, it should be returned to the system. If we fail to do so we get a memory leak, which eventually lead to an out of memory error. In Java the garbage collector is responsible for reclaiming allocated memory that is no longer in use. Unfortunately, with automatic memory management we can still run into memory leaks, when references to objects no longer in use are kept alive. In application servers where multiple classloaders are used, this can get quite complex, when for example classes are dependency injected by some framework. This will be noticed when we start redeploying the application over and over again.

Before an application is released, it should be tested for memory leaks. It is an all too common case to restart the JVM because we are running into an out of memory error. To test for memory leaks some tests should be run to measure the live set. Note that if the live increases over a short period of time it does not necessarily mean that we are dealing with a memory leak. It could also be due to an increased load, i.e., the application has to process more requests.

The JRockit memory leak detector collects trend analysis data by piggybacking on the garbage collector (the mark phase of a garbage collector already needs to traverse all live objects on the heap). The resulting graph is the data we need to perform a trend analysis for object allocation. The memory leak detector can be used to track changes over time for object types in the system. It can also track which object types are pointing to other object types and which instances refer to other instances. As already mentioned data is collected during the mark phase of a garbage collection, if for 10 seconds no garbage collection has occurred the tool will trigger one. Finding the cause of a memory leak is not an easy task and usually several tools have to be used as no off-line analysis is possible. Note that hprof dumps can be made and be analyzed off-line by other tools.

To start the memory leak detector, right click on a particular JVM and choose start memleak from the context menu.

The trend table can be used to find candidates for memory leaks. Classes with a high growth rate are colored red; a darker red is used for classes with a higher growth rate. The snapshot depicted above was taken after a considerable amount of time has been elapsed and note that it reveals the model.entities.Department class with a high growth rate. To find out what is pointing to the class at the top (in this case char[]) we add this type to the type graph. Right click on the type and select add to type graph.

The type graph shows how instances of classes point to other classes. By clicking on the + sign we can expand the class node and reveal other types. Note that model.entities.Department is also part of the type graph. The type graph snapshot was taken when the test was just running.

To find the instance that is being misused, right click on a relation of interest and select list referring instances. An instance can be added to the instance graph by right clicking on the instance and select add to instance graph. The snapshot shows an instance of java.util.ArrayList that holds elements of org.apache.xmlbeans.impl.schema.SchemaContainer, which are used by the Oracle Service Bus to process XML.

In order to pin-point what is keeping certain instances alive we can follow the following recipe:

  1. Find one of the leaking instances, by using the trend table.
  2. Find a path to the root referrer from the leaking instance.
  3. Eliminate what is causing the reference to be kept alive.
  4. If there is still a leak, proceed with step one again.

Finding an instance that is unnecessarily being kept alive can be very tricky. One thing to keep in mind is that it is common for collections to be misused and thereby causing memory leaks. If not dealt with, the memory leak will cause the collections to grow larger and larger, sometimes very slowly as in the example above, and eventually lead to an out of memory error.

How beautiful can it get, a wonderful JVM with superb analysis tools.

References

[1] Hirt and Lagergren, “Oracle JRockit: The Definitive Guide”, Packt Publishing, 2010. Read it and Learn!
[2] JRockit Documentation.