Sunday, December 2, 2012

JVM Tuning: UseParallelOldGC

The preceding posts prepared us to look at the JVM option -XX:+UseParallelOldGC. This option was introduced in Java 5 update 6 and became default with the throughput collector in Java 7 update 4. We do not specify a garbage collector, so we use the default for Java 6 update 14, which is throughput. (The alternative is CMS.) The default for our version also features UseParallelGC. What this means is that the young generation is collected by multiple parallel threads. But the old generation is collected by a single thread unless UseParallelOldGC is specified on the command line at startup. Marcus immediately recommended we specify -XX:+UseParallelOldGC, and all the available literature we could find, in print and online, agreed. In our own cautious way, we still tested this in our performance lab before putting it into production. Lab tests showed healthy improvement in the key metrics - 24% improvement in Full GC overhead, 29% improvement in average Full GC pause time, and 16% improvement in maximum Full GC pause time. Then, proceeding in our still cautious way, we tried it on one of our ten nodes in production.

Surprise! In production, UseParallelOldGC caused our overhead, our average Full GC pause time, and our maximum Full GC pause time to approximately double! All that caution seems to be worth something. We weren't able to correctly predict what the impact would be in production, but we minimized impact by trying it first on one node, and we gave ourselves the background to learn from the experience.

First of all, why did production behave differently from the performance lab trials? The lab differs from production in some pretty big ways. The lab has a 4GB heap and our LoadRunner license, which is already too expensive, does not generate a production size load. Our lab may be useful for gauging the relative effect of changes in our own code, but it failed to deliver value here.

Second, why did universally recommended practice backfire in our environment? I did find that we are not the only ones ever to experience this sort of surprise. Tony Chiu looked for answers at https://community.jboss.org/message/221725. He didn't get any, but comments tentatively suggest too many parallel threads contending over too small an old generation. So we tried specifying -XX:ParallelGCThreads=2 (default on our server is 4) to reducing contention, but results were even worse. So we wound up reverting to default, which is serial (single-threaded) collection of the old gen.

This is purely hypothetical now, but I suspect UseParallelOldGC requires more head room to operate, and our heap is too lean and hungry (recall the discussion of live data size vs. heap size in the previous post). Organizational realities (politics) prevent us from increasing heap size at this time, but I hope that upcoming changes in our StandardQueryCache will significantly reduce heap size, and then it may be worth trying UseParallelOldGC again.

JVM Tuning - Live Data Size

Another critical concept to understand is live data size. Charlie Hunt and Binu John define this as the occupancy of the old generation and the permanent generation after a full collection, when the application is running in steady state.  Our application doesn't have a steady state, but it does have predictable peak usage hours every day. Here's a snippet from one of our GC logs in production during peak hours:

[Full GC
[PSYoungGen: 32045K->0K(4118912K)]
[PSOldGen: 6684267K->4483995K(8388608K)] 6716313K->4483995K(12507520K)
[PSPermGen: 716799K->716799K(716800K)], 10.6217130 secs
]


We can see from the bold figures that our old gen occupancy is 4.3GB and our perm gen occupancy is 700MB (and it is full). It's also interesting to note that we see about the same live data size in a 12GB heap and in an 8GB heap. Our third on-site Red Hat consultant (we'll call him Boris), who has seen a lot of different operations and seems to know his stuff well, was surprised that our live data size could be so big. Why so big? Well, it turns out our secondary Hibernate cache is 3.9GB - 90% of live data size. And our StandardQueryCache alone is 3.4GB. So we have a big investment there. Is it beneficial? The best way to know is to test - preferably in a performance lab, not directly in production. However, a meaningful performance lab is not cheap and requires real organizational commitment (including money and time). We have, anyway, figured out that some of our StandardQueryCache is counterproductive. SQC uses a very blunt invalidation - if any change is made to any row of a table, then any SQC for that table is invalidated. So SQC should only be used for read-only or read-mostly tables. We'll be cleaning that up.

Hunt and John recommend, as a general guideline, setting the heap size to be 3 to 4 times the live data size of the old generation. For us, that would be 12.8GB to 17.1GB. So, even after bumping up our heap from 8GB to 12GB, we are still running a lean and hungry heap, by this measure. They also recommend setting perm gen size to 1.2 to 1.5 times the live data size of the perm gen. Here, that would be 840MB to 1050MB.