Sunday, December 2, 2012

JVM Tuning: UseParallelOldGC

The preceding posts prepared us to look at the JVM option -XX:+UseParallelOldGC. This option was introduced in Java 5 update 6 and became default with the throughput collector in Java 7 update 4. We do not specify a garbage collector, so we use the default for Java 6 update 14, which is throughput. (The alternative is CMS.) The default for our version also features UseParallelGC. What this means is that the young generation is collected by multiple parallel threads. But the old generation is collected by a single thread unless UseParallelOldGC is specified on the command line at startup. Marcus immediately recommended we specify -XX:+UseParallelOldGC, and all the available literature we could find, in print and online, agreed. In our own cautious way, we still tested this in our performance lab before putting it into production. Lab tests showed healthy improvement in the key metrics - 24% improvement in Full GC overhead, 29% improvement in average Full GC pause time, and 16% improvement in maximum Full GC pause time. Then, proceeding in our still cautious way, we tried it on one of our ten nodes in production.

Surprise! In production, UseParallelOldGC caused our overhead, our average Full GC pause time, and our maximum Full GC pause time to approximately double! All that caution seems to be worth something. We weren't able to correctly predict what the impact would be in production, but we minimized impact by trying it first on one node, and we gave ourselves the background to learn from the experience.

First of all, why did production behave differently from the performance lab trials? The lab differs from production in some pretty big ways. The lab has a 4GB heap and our LoadRunner license, which is already too expensive, does not generate a production size load. Our lab may be useful for gauging the relative effect of changes in our own code, but it failed to deliver value here.

Second, why did universally recommended practice backfire in our environment? I did find that we are not the only ones ever to experience this sort of surprise. Tony Chiu looked for answers at https://community.jboss.org/message/221725. He didn't get any, but comments tentatively suggest too many parallel threads contending over too small an old generation. So we tried specifying -XX:ParallelGCThreads=2 (default on our server is 4) to reducing contention, but results were even worse. So we wound up reverting to default, which is serial (single-threaded) collection of the old gen.

This is purely hypothetical now, but I suspect UseParallelOldGC requires more head room to operate, and our heap is too lean and hungry (recall the discussion of live data size vs. heap size in the previous post). Organizational realities (politics) prevent us from increasing heap size at this time, but I hope that upcoming changes in our StandardQueryCache will significantly reduce heap size, and then it may be worth trying UseParallelOldGC again.

JVM Tuning - Live Data Size

Another critical concept to understand is live data size. Charlie Hunt and Binu John define this as the occupancy of the old generation and the permanent generation after a full collection, when the application is running in steady state.  Our application doesn't have a steady state, but it does have predictable peak usage hours every day. Here's a snippet from one of our GC logs in production during peak hours:

[Full GC
[PSYoungGen: 32045K->0K(4118912K)]
[PSOldGen: 6684267K->4483995K(8388608K)] 6716313K->4483995K(12507520K)
[PSPermGen: 716799K->716799K(716800K)], 10.6217130 secs
]


We can see from the bold figures that our old gen occupancy is 4.3GB and our perm gen occupancy is 700MB (and it is full). It's also interesting to note that we see about the same live data size in a 12GB heap and in an 8GB heap. Our third on-site Red Hat consultant (we'll call him Boris), who has seen a lot of different operations and seems to know his stuff well, was surprised that our live data size could be so big. Why so big? Well, it turns out our secondary Hibernate cache is 3.9GB - 90% of live data size. And our StandardQueryCache alone is 3.4GB. So we have a big investment there. Is it beneficial? The best way to know is to test - preferably in a performance lab, not directly in production. However, a meaningful performance lab is not cheap and requires real organizational commitment (including money and time). We have, anyway, figured out that some of our StandardQueryCache is counterproductive. SQC uses a very blunt invalidation - if any change is made to any row of a table, then any SQC for that table is invalidated. So SQC should only be used for read-only or read-mostly tables. We'll be cleaning that up.

Hunt and John recommend, as a general guideline, setting the heap size to be 3 to 4 times the live data size of the old generation. For us, that would be 12.8GB to 17.1GB. So, even after bumping up our heap from 8GB to 12GB, we are still running a lean and hungry heap, by this measure. They also recommend setting perm gen size to 1.2 to 1.5 times the live data size of the perm gen. Here, that would be 840MB to 1050MB.

Saturday, November 24, 2012

JVM Tuning - Heap Structure

Now for an overview of the structure of the JVM heap.

The heap is organized into three areas - the young generation, the old  (or tenured) generation, and the permanent generation. "Permanent generation" is a bit of a misnomer. It holds metadata such as class definitions. Newly created objects go into the Eden space in the young generation. The young generation also has the survivor spaces, "from" and "to", which we will explain in a moment. When the young generation is too full to accommodate new objects, a minor or partial collection occurs.


The left side of the diagram shows what happens in a healthy partial GC:

  1. Dead objects are marked. (shaded black in diagram)
  2. Live Eden objects move to the To space.
  3. Young enough objects in the From space move to the To space.
  4. Old enough objects in the From space move to the old generation.
The right side shows the heap after the partial GC is complete. Now Eden is empty, and the From and To spaces have traded roles.

Later, we will discuss what happens if there isn't enough room in the To space or in the old generation to accommodate the live objects that belong to them.

Friday, November 23, 2012

JVM Tuning - Heap Size

We're having a series of consultants come in from Red Hat come in to help us ensure that our production environment is in good shape. One of the things we asked the first one (we'll call him Marcus) to do was to evaluate our JVM options. In retrospect, we should have made a point of better understanding what his areas of expertise were, and limiting our requests accordingly; and he should have deflected requests that did not match his strong points. But live and learn, eh? And I've learned some things about JVM tuning through this process.

Prior to August this year, we ran twelve 8GB (specified by -Xms=8192m -Xmx=8192m) nodes in our main production cluster. While we were in the midst of a series of painful production outages marked by sudden episodes of back to back full garbage collections, Red Hat support recommended we allocate 16GB per node. There was some internal resistance to doing this (it seems an architect no longer with the company - we'll call him Eddie) had said that larger heaps would cause longer full GC pauses), but we went so far as to increase four nodes to 12GB (-Xms=12288m -Xmx=12288m). This was done a few weeks before Marcus came on site. GC logs since then showed that the 12GB nodes performed better than the 8GB nodes. Not only did they have more throughput, but also the GC overhead and the average and maximum pause times were significantly improved - contrary to what Eddie had said.


These graphs summarize GCHisto analysis of production GC logs over one week. (GCHisto, by the way, is a superb free open source application that is simple and fast, and gives a quick easy overview of key statistics from GC logs.) The first four nodes are 12GB heaps. Overhead % is the amount of time paused for GC divided by the amount of time the application ran. Max pause is the number of seconds of the longest pause for a full GC. We see that the 12GB nodes are better and more consistent. The 8GB nodes also seem to be more reactive to stress.

So it was a no brainer to go ahead and move our production cluster to ten 12GB nodes instead of four 12GB and eight 8GB. So far, so good. We didn't need a consultant to see this. But, the way decisions are made in organizations, it can help get things done if an outside "expert" has been paid to tell us what to do.

But we'll be revisiting this question of heap size in a future post. Don't forget that in August, Red Hat support recommended we take this further, and go to 16GB heaps.