Companies might be afraid of jumping on the JVM as it suffers from a bad reputation of being atrociously slow from time to time and the reason given is usually that GC -garbage collector- is just terribly slow and block the application threads from running for huge periods of time. This is of course not acceptable in many applications and we will see next how we can understand and tune the garbage collector to avoid long pauses.
Furthermore leveraging modern reactive distributed platforms like Akka clustering and sharding help us scale horizontally and thus mitigate potential issues which might arise with a very large heap in a monolithic application.
Choosing a type of GC comes with trade-offs. You have to choose between maximum throughput or low pauses. A web server is typically an application where you would prefer low pauses as you do not want a customer to wait multiple seconds to get a page. On the other side a batch job or crawler would be a good candidate for high throughput and can tolerate longer pauses. What choices do we have?
As multicore machines are now everywhere, you should almost always use:
This runs the garbage collection in parallel for both the young and old generation -see below for young/old generations meaning-
Note that this is the the default on a 64 bits multi core machine, so you don’t have to explicitly specify it.
CMS is the older low pause GC option while G1 has matured considerably since its availability in JDK 7 and is considered to be the default in JDK 9. We will talk more about the two in the next sections, just keep in mind that CMS requires usually advanced tuning and therefore G1 is intended to replace it and all Oracle work mostly goes towards G1. Therefore G1 should probably be your first option.
To activate CMS, use: -XX:+UseConcMarkSweepGC
To activate G1, use: -XX:+UseG1GC
Note: There are also commercial JVMs available to reduce pauses like Azul Zing or JRockit Real Time; we are not going to describe those.
This model of contiguous memory regions is true in all GCs but G1.
G1 divides all the heap in a lot of smaller regions -it targets 2048 regions-. A region can be Eden, Survivor or tenured. The benefit of splitting the heap in regions is that G1 GC can decide to analyze only a part of the heap by selecting regions where there is plenty of garbage in order to meet a given pause time and delay other regions collection to later.
We won’t talk about permanent generation.
In JDK 8, permanent generation has been replaced by native space off the heap called metaspace. Prior to JDK 8, there is not a lot of easy ways to tune the perm size, so just be sure it is big enough for your needs by adjusting -XX:MaxPermSize -it contains classes metadata like hierarchy information or method bytecode-
Once you figure out the heap size of your warmed-up application, It is preferable to set the initial size equal or close to the maximum heap size. This will avoid losing performance as the JVM would otherwise need to resize the heap after each GC. To set it to 4GB, this is done with: -Xms4048m and -Xmx4048m
Without setting this, defaults are:
-XX:MinHeapFreeRatio=40 and -XX:MaxHeapFreeRatio=70
All objects are created in the Eden space. Whenever the Eden fills up, a young generation collection is triggered, objects not referenced anymore are collected and the rest are survivors.
Survivors spaces S0 and S1 are identical in size and one will always be empty. Initially both are empty, so survivors are moved to one of them on the first young GC, let’s assume S0. Survivors are marked with a counter specifying how many young GC they survived. On the second young GC, survivors from the Eden space are moved to S1 and the survivors from the S0 have their counter incremented and are moved to S1 as well, S0 is now empty. This goes on until the survivors space fill up or a certain threshold is met -number of times objects survive- and are then moved to the tenured space. Once they are in the tenured space, there is no coming back.
Collection of the old generation depends on the chosen GC. It will be triggered when the heap fills up to a certain amount -configurable via options, 92% of old generation in JDK 8 for CMS- and usually the collector uses some heuristics to schedule the next collections at the best time according to your application runtime behaviour and execution. Most collectors like CMS collect the old generation separately of the young generation, though both can interleave. G1 uses mixed collections instead of regular old collections and they are slightly different. A marking cycle followed by a mixed GC is triggered by default when 45% of the total heap is full. They piggyback on the young collection and each mixed collection then collects only a part of the old generation. It is tunable and by default it will try to do less than 8 successive mixed collections to collect all the old regions where garbage is above a certain tunable threshold. By avoiding to collect the whole old generation at once like the other GCs, G1 usually comes with shorter pauses and thus tends to perform better than CMS with larger heaps ( > 10GB).
You should almost never manually trigger the GC in the app via System.gc() as the JVM is likely to be smarter than you.
Some people tend to believe that only an old collection can be a Stop The World operation and therefore don’t pay much attention the young GC. Actually young GC is a STW operation. It is parallelized -except if you explicitly choose the serial version- and usually is not much of a concern because it happens really fast. Also this is the same idea for all GCs, so regarding the young GC, it does make a difference if you choose CMS, G1 or ParalellOldGC.
Marking garbage in the Eden is fast, yet two other aspects can cause the young GC to take more time:
Most objects die young. You want to avoid promoting them to the old space if they are supposed to die not long after and you want to avoid moving them over and over between the survivors spaces if they are very likely to end up in the old generation anyway because they are long-lived objects.
Let’s assume we have a web server and a typical request/response flow creates many objects that are garbage after the response is sent. If this response is usually sent in 500ms, then it would be nice if we tune the young generation so that objects don’t get promoted within 500ms. The same may be true for a small cache with short TTL.
Ideally we want all these objects to die in the Eden space, so the overhead is minimal. Looking at the GC logs with tools described in a later section, we can identify young generation periods. In this case we would like the period to be > 500ms. This could be achieved by increasing the young generation size globally or increasing the Eden space relative to the survivors spaces. Tuning the survivors space is not straightforward as you do not want it too large or you may waste a lot of space if many objects die in the Eden space; you do not want to set it too small either as it could lead to premature promotion to the old space and thus also potentially increasing future young GC pauses because of scanning old to new generations references. You do not want to do excessive moving from one space to the other either. By default, the JVM will adjust the threshold to promote survivors to the old generation to keep the survivors space half full.
This is what people are usually afraid of regarding long pauses and this leads to a confusion between a GC cleaning the old generation and a full GC cleaning the young generation and the old generation. Full GC is a STW operation, it is single threaded and thus can be really slow, so this is the one to avoid. Full GC is single threaded in all GCs: ParallelOld, CMS or G1.
Choosing the right GC for cleaning the tenured space via old GC for CMS or mixed collections for G1 should lead to short pauses as most major expensive steps are concurrent with application threads and do not pause them. Therefore you really want to avoid full GC, not old GC and you should not fear having objects in your tenured space.
So what may cause a full GC to happen if the old GC should already clean the tenured space?
Logging the activity is crucial to be able to analyze and tune it. We do not want to do premature tuning as it could be more detrimental than beneficial.
Set these options when starting the JVM:
Depending on the selected GC, the format will be different but you can take a look at the log to see stop the world pauses:
The following denotes a a young generation; you should be looking at the real time as this relates to the stop the world pause time.
2016-05-25T16:42:11.034+0500: 1.243: [GC (Allocation Failure) 2016-05-25T16:42:11.034+0500: 1.243: [ParNew: 545344K->42122K(613440K), 0.0452812 secs] 545344K->42122K(12269056K), 0.0453581 secs] [Times: user=0.27 sys=0.03, real=0.05 secs]
The following denotes a CMS phase. The phases marked as concurrent are not pausing your application threads. The phases pausing your application are CMS-initial-mark and CMS Final Remark. Note also that CMS is interleaved with young generation collection ParNew
2016-05-25T16:44:58.942+0500: 169.152: [GC (CMS Initial Mark) [1 CMS-initial-mark: 5863216K(11655616K)] 5935974K(12269056K), 0.0093617 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
2016-05-25T16:44:58.952+0500: 169.161: [CMS-concurrent-mark-start]
2016-05-25T16:44:59.146+0500: 169.356: [CMS-concurrent-mark: 0.195/0.195 secs] [Times: user=1.12 sys=0.07, real=0.19 secs]
2016-05-25T16:44:59.146+0500: 169.356: [CMS-concurrent-preclean-start]
2016-05-25T16:44:59.166+0500: 169.375: [CMS-concurrent-preclean: 0.019/0.019 secs] [Times: user=0.10 sys=0.00, real=0.02 secs]
2016-05-25T16:44:59.166+0500: 169.376: [CMS-concurrent-abortable-preclean-start]
2016-05-25T16:45:00.057+0500: 170.267: [GC (Allocation Failure) 2016-05-25T16:45:00.057+0500: 170.267: [ParNew: 613440K->68096K(613440K), 0.1139297 secs] 6476656K->5983896K(12269056K), 0.1139964 secs] [Times: user=0.69 sys=0.03, real=0.12 secs]
2016-05-25T16:45:01.029+0500: 171.239: [CMS-concurrent-abortable-preclean: 1.750/1.864 secs] [Times: user=9.22 sys=0.17, real=1.87 secs]
2016-05-25T16:45:01.030+0500: 171.240: [GC (CMS Final Remark) [YG occupancy: 486094 K (613440 K)]2016-05-25T16:45:01.030+0500: 171.240: [Rescan (parallel) , 0.1157711 secs]2016-05-25T16:45:01.145+0500: 171.355: [weak refs processing, 0.0000302 secs]2016-05-25T16:45:01.146+0500: 171.355: [class unloading, 0.0014006 secs]2016-05-25T16:45:01.147+0500: 171.357: [scrub symbol table, 0.0010041 secs]2016-05-25T16:45:01.148+0500: 171.358: [scrub string table, 0.0001544 secs][1 CMS-remark: 5915800K(11655616K)] 6401895K(12269056K), 0.1186299 secs] [Times: user=0.90 sys=0.01, real=0.11 secs]
2016-05-25T16:45:01.148+0500: 171.358: [CMS-concurrent-sweep-start]
2016-05-25T16:45:13.840+0500: 184.050: [CMS-concurrent-sweep: 11.559/12.691 secs] [Times: user=65.02 sys=1.26, real=12.70 secs]
2016-05-25T16:45:13.840+0500: 184.050: [CMS-concurrent-reset-start]
2016-05-25T16:45:13.902+0500: 184.112: [CMS-concurrent-reset: 0.062/0.062 secs] [Times: user=0.27 sys=0.03, real=0.06 secs]
Look for the term evacuation pause in young or mixed collections:
2016-05-25T13:53:24.046+0500: 25.058: [GC pause (G1 Evacuation Pause) (young), 0.1689750 secs]
[Parallel Time: 167.7 ms, GC Workers: 8]
[GC Worker Start (ms): Min: 25058.6, Avg: 25059.9, Max: 25062.1, Diff: 3.5]
[Ext Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 0.7]
[Update RS (ms): Min: 11.5, Avg: 13.3, Max: 14.4, Diff: 2.9, Sum: 106.1]
[Processed Buffers: Min: 8, Avg: 12.4, Max: 18, Diff: 10, Sum: 99]
[Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.5]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 152.2, Avg: 152.7, Max: 152.9, Diff: 0.7, Sum: 1221.8]
[Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.6]
[Termination Attempts: Min: 1, Avg: 2.1, Max: 3, Diff: 2, Sum: 17]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 164.1, Avg: 166.2, Max: 167.6, Diff: 3.5, Sum: 1330.0]
[GC Worker End (ms): Min: 25226.2, Avg: 25226.2, Max: 25226.2, Diff: 0.0]
[Code Root Fixup: 0.0 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.2 ms]
[Other: 1.1 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.4 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.1 ms]
[Eden: 176.0M(176.0M)->0.0B(176.0M) Survivors: 26.0M->26.0M Heap: 2575.0M(4048.0M)->2525.5M(4048.0M)]
[Times: user=0.88 sys=0.07, real=0.17 secs]
We will use GCViewer in the following examples.
The application is available at this github repository and can be run with sbt run
Note: Test machine is a MacBook-Pro 2015 8 cores, 16GB RAM
The application is composed of a master Actor creating Worker actors. Each worker actor lives for a certain amount of time during which it will simply allocates new objects on the heap. We say that the worker has n lives. At each step, it creates some garbage immediately collectible by the GC and also create some data that it will keep until it dies. Thus at each step the actor will contain more data in its state. As soon as an actor runs out of lives, the master kills it and creates a fresh worker. At that time, the died worker is collectible along with its state.
We collect the time needed to allocate new objects in the worker as this is likely to trigger a GC; thus we can monitor the global throughput of the app as well as the number of times the app experiences significant pauses. For this number to be reasonably accurate, it is best to run with a number of workers < number of cpu cores.
Tunables are the number of workers, the lives upper bound (randomly chosen from 1 to this bound), the generated data list size upper bound (randomly chosen from 1 to this bound), the test duration
In this example we only give a maximum of 10 lives to the worker, so that they should never be able to go the tenured space:
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_10_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Nothing gets promoted to the tenured space. Young collection happen every 1 second or so and collect all data as the grey curve is stable.
Average young GC pause is 6 ms
In this case, we can reduce the number of pauses while keeping a low pause by simply increasing the young space, thus increasing overall throughput. In the example below, accumulated pauses decreased from 0.25 to 0.08 and average pause is 5ms.
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_10_1000_60.log", "-XX:NewRatio=1", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Going further, because probably nobody survives, we can give more space to Eden by reducing the size of the survivors; this gives 0.06 accumulated pauses and less than 5ms pauses
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_10_1000_60.log", "-XX:NewRatio=1", "-XX:SurvivorRatio=12", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Increasing the number of lives will create survivors and eventually promote them to the tenured space.
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_10_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Most pauses are ParNew, meaning young generation GC, the STW phases of CMS are in fact very short here -GC CMS pauses above-. The dark purple line shows CMS in action after 30 seconds cleaning the old generation mainly concurrently, that’s good. Average pause is 169 ms.
Again here, increasing the young generation size is a good idea because we only have middle aged garbage, so we don’t need to push it to the old generation; by increasing the ratio, young GC will run less frequently, more objects will die in Eden and less objects will be moved around in the survivors. Below, we achieve 95ms pause time and 5 times less total pause time.
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-XX:NewRatio=1", "-Xloggc:gc_cms_4g_5_5000_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Note: Usually in real world applications, you might not want a ratio of 1. A ratio of 2 is probably a better limit.
G1 is simpler because you just have to specify your target pause time and it will try to do its best to meet the time.
"-Xms4048m", "-Xmx4048m", "-XX:+UseG1GC", "-XX:MaxGCPauseMillis=100", "-Xloggc:gc_g1_4g_5_5000_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
G1 has an adaptive policy to resize automatically the young generation to meet the pause time. This can be observed in the image below -orange and pink regions- and has the advantage that we don’t need to tune a lot ourselves.
Average pause is 74ms. If you increase your target pause time, you will get increased throughput. Below target is 500ms.
"-Xms4048m", "-Xmx4048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_50000_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
STW pauses are easy to see. They appear as large black columns. In the image above, the first STW pause is 5 seconds and the second is 4 seconds. If we test it for 4 minutes, we see even more STW pauses.
One indication that we would benefit from a bigger heap is the heap size after each CMS GC. It it still pretty close to the maximum heap size. As a result CMS runs pretty often but does not reclaim a lot of space. Therefore it is easy to create a few more objects and be above the heap size. By increasing the heap size, we fix the problem.
"-Xms8048m", "-Xmx8048m", "-XX:+UseConcMarkSweepGC", "-Xloggc:gc_cms_4g_5_50000_1000_60.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
CMS comes with one other potential problem; it only marks regions that are garbage but does not compact them. This may lead to fragmentation and you may observe a STW full GC even if your live objects size does not account for your full heap size. There is not much you can do here with CMS. Yet CMS is pretty smart about it and will try to fill holes with new objects of the same size.
By adding -XX:PrintFLSStatistics=1 you can see fragmentation over time in the log.
Check these lines in the log:
Total Free Space: 198303811
Max Chunk Size: 197593752
Here we notice, a little fragmentation, yet here the impact should be minimal.
The first time, it will kick in based on the value of -XX:CMSInitiatingOccupancyFraction which depends on the Java release (68% in older JDKs, 92% in JDK 8). If your application is creating large objects, have small survivor
spaces or a lot of objects which do not die in the young space, then you may have a lot of objects promoted to the old generation. If you see that the CMS kicks in too late and is not able to complete fast enough -meaning that the rate of promotion to the old space is too high-, then you might want to adjust this parameter to a lower value for CMS to kick in sooner. After the first run, the JVM will analyze and learn from your creation/die rates and adjust following CMS start timing accordingly. Usually the JVM does a better job than you, so you should be satisfied with this behaviour. If you really know your data lifecycle very well, you can force it to always run at a given threshold by setting-XX:CMSInitiatingOccupancyFraction and not based on heuristics by setting -XX:+UseCMSInitiatingOccupancyOnly
If you see that your CMS collection does not reclaim much heap space, then there is no benefit to lower this value; in fact you probably want to increase it to reduce the global overhead of the GC. You probably want to give a bigger heap size as well.
G1 is not exempt of long STW pauses, if your heap is too small, full GC will happen as well.
"-Xms4048m", "-Xmx4048m", "-XX:+UseG1GC", "-XX:MaxGCPauseMillis=800", "-Xloggc:gc_g1_4g_5_5000_1000_240_800ms.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Increasing the heap size to 6G fixes it.
"-Xms6048m", "-Xmx6048m", "-XX:+UseG1GC", "-XX:MaxGCPauseMillis=800", "-Xloggc:gc_g1_4g_5_5000_1000_240_800ms.log", "-XX:+PrintGCDetails", "-XX:+PrintGCDateStamps", "-XX:+PrintGCTimeStamps"
Start simple with G1, just adjust your heap size and set a target pause time. Estimate your heap size based on your long lived data and double it to set your heap. G1 works better with more room to play, so leave half your heap empty.
G1 suffers if it needs to allocate a lot of humongous regions. They will be created each time an object size > 50% of the region size. They will waste space as nothing else will be created in the region. Thus if its size is 51%, you will waste 49% of the region. Worse, if a region is 2MB and your object is 2.1MB, it will waste 1.9MB in the second region. If you allocate large objects, adjust your XX:G1HeapRegionSize. G1 will target 2048 regions with region size ranging from 1MB to 32MB. So you can calculate your region size based on the size of your heap. With a small heap size of 4GB, a region size will be 2MB. If you allocate many objects bigger than 1MB, you can set this parameter to a higher value to increase the region size and decrease the number of regions.
G1 suffer less from fragmentation than CMS as it does compact data based on -XX:G1MixedGCLiveThresholdPercent
If a region contains a lot of garbage it will be collected during a mixed collection and the little live data will be moved to an empty region, thus eliminating fragmentation for this region. This collected region can then be reclaimed. Fragmentation remains in regions not moved because they contain a lot of live data and therefore would be too expensive to move.
There a few experimental flags to tune G1, but only use them if really needed.
They are described here by oracle
Understanding what your application does is crucial here. As an example, an application doing a lot of IO like a database or a search engine should leverage the operating system buffer cache and therefore leave plenty of RAM to the OS.
Also notice in the elasticsearch doc the optimization that the JVM can do for heap < 32GB on a 64 bits architecture thanks to the use of 32 bits pointers.
Finally disable swapping.
Building one giant application with 100GB of heap size is usually not the best strategy as it can lead more easily to longer pauses and is also harder to profile and tweak as many different things might run on the same JVM and those would ideally require different tweaking. Try to build your app with elasticity in mind, distributed systems and stateless services; Akka cluster may be your friend here via sharding. If you are not able to meet your latency requirements on one node, it would be easy to add more shards to your architecture and ask each node to do less. You might also mitigate issues by shutting down and restarting a node which is an expected scenario on a distributed platform and should impact you less -downtime, higher load on the other few machines, etc- than with a monolith.
Of course choose distributed systems if it makes sense for you.
G1 is clearly the future of GC in HotSpot. Given its simplicity and the work that Oracle has done and is still doing, it will only get better. Therefore try it as your first choice on the JDK 8 and only fall back to CMS if it does not work well for you.
Do not forget that you have tools and frameworks available to you to monitor your application but also to design it in a horizontally scalable and resilient fashion; by using Akka clustering and sharding, you can control how large your heap should be on a single node which may help you if you choose to go with CMS.