We're having a series of consultants come in from Red Hat come in to help us ensure that our production environment is in good shape. One of the things we asked the first one (we'll call him Marcus) to do was to evaluate our JVM options. In retrospect, we should have made a point of better understanding what his areas of expertise were, and limiting our requests accordingly; and he should have deflected requests that did not match his strong points. But live and learn, eh? And I've learned some things about JVM tuning through this process.
Prior to August this year, we ran twelve 8GB (specified by -Xms=8192m -Xmx=8192m) nodes in our main production cluster. While we were in the midst of a series of painful production outages marked by sudden episodes of back to back full garbage collections, Red Hat support recommended we allocate 16GB per node. There was some internal resistance to doing this (it seems an architect no longer with the company - we'll call him Eddie) had said that larger heaps would cause longer full GC pauses), but we went so far as to increase four nodes to 12GB (-Xms=12288m -Xmx=12288m). This was done a few weeks before Marcus came on site. GC logs since then showed that the 12GB nodes performed better than the 8GB nodes. Not only did they have more throughput, but also the GC overhead and the average and maximum pause times were significantly improved - contrary to what Eddie had said.
These graphs summarize GCHisto analysis of production GC logs over one week. (GCHisto, by the way, is a superb free open source application that is simple and fast, and gives a quick easy overview of key statistics from GC logs.) The first four nodes are 12GB heaps. Overhead % is the amount of time paused for GC divided by the amount of time the application ran. Max pause is the number of seconds of the longest pause for a full GC. We see that the 12GB nodes are better and more consistent. The 8GB nodes also seem to be more reactive to stress.
So it was a no brainer to go ahead and move our production cluster to ten 12GB nodes instead of four 12GB and eight 8GB. So far, so good. We didn't need a consultant to see this. But, the way decisions are made in organizations, it can help get things done if an outside "expert" has been paid to tell us what to do.
But we'll be revisiting this question of heap size in a future post. Don't forget that in August, Red Hat support recommended we take this further, and go to 16GB heaps.
No comments:
Post a Comment