In our new Java Monthly edition, we’d like to introduce you to Monica Beckwith. She was kind enough to share her experience on 12 more Java-related questions. Mind you, you can find the majority of the answers and more in Monica’s upcoming book “JVM Performance Engineering: Inside OpenJDK and the HotSpot Java Virtual Machine”, which we are so excited for! Don’t forget to check it out!
Monica Beckwith is the Java / JVM performance architect at Microsoft. Prior to this, Monica worked as a Runtimes performance architect at ARM as well worked at Oracle as a Java HotSpot performance engineer. In addition to her work to port the JVM to ARM, Monica has made numerous performance contributions to the Java HotSpot VM. As well as creating a number of JVM heuristics for GC and JIT optimizations. She also was involved in creating NUMA-aware allocators.
Monica is considered one of the influential women in Java and Scala. For this she was inducted into the Java Champions program. She is a champion of girls in STEAM and loves volunteering her time in coaching kids in the field of (Java) programming and more recently, Lego Robotics. Monica has co-authored the ‘Java Performance Companion’ book and is currently working on a book titled ‘JVM Performance Engineering: Inside OpenJDK and the HotSpot Java Virtual Machine’. Additionally, Monica has spoken at many different conferences and has been awarded several JavaOne Rockstar.
You can catch Monica online @mon_beck (Twitter) and also look up her presentations @SlideShare.
Dreamix: Could you name a couple of must-have characteristics of a well-designed garbage collector?
Monica Beckwith: A well-designed collector is the one that is able to meet the desired system-level agreements or
performance requirements of the system. Since throughput and latency are the two main drivers towards refinement in GC algorithms, you will find that when looking for an optimized GC, we are looking for one of the two things –
– we either want to maximize the throughput of an application and hence looking for a GC that I call
the throughput maximizer or
– we are looking for a GC that helps maintain the high responsiveness of an application by introducing low-to-no pauses thus avoiding long tail latencies. I call such a GC as
low latency optimized.
For middleware applications where higher throughput is desired and soft predictability of response times in the 10ms-200ms range is acceptable, a GC that can straddle the two extremes works like a charm.
Dreamix: What are the key factors on GC performance within a system?
Monica Beckwith: I recently spoke at GeeCon in Prague and one of the things that I covered was about performance requirements that drive innovation in garbage collectors (the ones that move live objects around) and there were these two things that I highlighted:
a) Generations: Basically, majority workloads have shown that the weak-generational hypothesis works and both ZGC and Shenandoah, which are currently single space solutions, are being fitted with generational spaces. So, if we start a word cloud around generational algorithm, it will look like below:
The two main characteristics for a generational GC are related to the weak-generational hypothesis – most objects die young, which means we only promote objects that are long lived and if we have an efficient generational GC, we don’t promote transients nor do we promote medium lived. This usually results in smaller long lived data sets and GC is not
overworked keeping premature promotions, fragmentation, evacuation failures and similar degenerative issues at bay.
The third one is about the footprint and responsiveness overhead of maintaining generations – the generational algorithm has proven to be great help to OpenJDK GCs, but there is no free lunch. Since the young generation collector works separately and more often than the old generation collector and ends up moving live data, there needs to be a way for the GC to understand any old to young references hence generational GCs must have maintenance/bookkeeping overhead to be sure that it is marking all reachable objects.
b) GC Work: This is the type of work each GC needs to do to meet its SLA/performance requirements. So, for a throughput maximizer this would be “parallel work”. A GC word cloud for the parallel work would look like below:
A throughput maximizing collector should do parallel work for two reasons, the first one is to maximize efficiency by incorporating as many threads as needed to reach max scalability factor. The second one is to shorten the duration of the pause, so for that once again, the GC work needs to be parallelized to the max. While we will still have to carry out some of the work in series, e.g., clearing and resetting entries and tables can only happen after they have been
processed by the in-progress GC cycle, nonetheless, we can apply queuing theory principles of task queues, task stealing, etc., to the GC tasks. This will once again help us maximize the scaling factor.
Finally for a throughput maximizer, you need to monitor the free “to” space so that allocations and promotions or even plain movement due to compaction can be facilitated, else we will encounter a series of “to-space” overflows aka “evacuation failures”. In such a situation, a fallback GC (can be the same full GC that your GC uses for its full heap compaction) is desired or else the application will be brought to its knees.
The GC work for a low latency optimizing collector is a bit different in the sense that the garbage collector will need more control of its work as it needs to guarantee sub-milliseconds pauses. I call this as “tunable work units.” The first thing that is needed for achieving fine granularity aka tunability is having the heap be divided into regions. This means that as little as a single region can make up a GC cycle.
Regions add additional footprint concerns due to maintenance barriers which e.g., in G1’s case are needed to keep the per region remembered sets updated with old-to-old or old-to-young incoming references. The second thing that would be needed is for the GC to do its work concurrently with the application which will also introduce some maintenance barriers into the equation as we need to maintain an accurate representation of the live object graph.
For a mark-compacting collector, the GC work is in marking (to track and maintain the live object graph) and then performing compaction to avoid evacuation failures due to fragmentation. When designing a low latency optimizing GC, we look for completing the work concurrently and incrementally. Thus, the incremental (or multi-phased) marking and partial (or incremental) compaction.
Finally for all concurrent collectors, I always like to mention the need for graceful degradation which is when the GC asks the application to back off from allocations until it can catch up or the GC will enlist the application’s help to do some of the GC work.
Dreamix: Generational GCs have been very successful in the past. Z GC is a single-generation GC right now. What is the importance of adding multi-generational to Z GC in its future?
Monica Beckwith: As mentioned in the
generations section above, most applications have shown that weak generational hypothesis works and provides us with better throughput. In ZGC’s case without generations, its throughput can suffer a lot and many-a-times there can be bloating with respect to footprint or CPU utilization. With generations and with separate mark and compact for the young generation, Z GC not only saves on its footprint but also can achieve better throughput without increasing application pause times – which is currently in sub-milliseconds
Dreamix: You mention a default “out of the box” garbage collector that works for average or even above-average users in one of your previous interviews. How hard is it to achieve this?
Monica Beckwith: Yes, I was talking about how the garbage collector has every information on your workload that
it may need to adapt to the nuances and changing patterns of your workload.
To make a GC adaptable we need to understand what we need to optimize and how – so let’s take a brief look at both. Note that below I am specifically talking about the tracing and moving / compacting collectors that we have in OpenJDK –
What needs optimizing?
For a well-designed copying/moving/compacting garbage collector, when you profile/look at the logs, you will find that that majority of the resources are consumed by the GC when it’s doing the live object move/copy. This phase also causes a lot of disruption (with respect to generational GC upkeep and polluting the hardware caches, etc.). No matter the type of GC or the workload, the GC’s job is to try to keep this move/copy to be as little as possible and do the work as late as possible so that the GC can mostly stay out of the way of the application.
How do we achieve that?
We need to have a support system comprising of 3 things: adaptive reactions, near real- time predictions and fast recovery. Let me explain each in detail:
– A GC is mostly reactive, even if it has some operations/cycles that are called “proactive” or “pre-emptive” it’s nothing but the GC reacting to the thresholds we have set within the GC. If we make these thresholds adaptive, then the GC’s reactions will be adaptive as well.
– Next comes the prediction logic – if the logic considers the real time density of the data structures that need to be processed and can affect the GC cycles by appropriately budgeting and triggering cycles as needed as well as abort or curtail a cycle as needed, then we can get real-time predictions and take advantage of them
– Lastly, all GCs should have the ability to fail fast by either gracefully “degrading” aka pushing back on the application or enlisting the application to help or have a fallback for failures. Both the “degradation” or the fallback procedures should provide quick recovery to the collector to get back on top of its work.
Our current default (since JDK 9) for OpenJDK is G1 GC. The OpenJDK engineers have made great improvements to the adaptability of G1. Thomas Schatzl has a few great write ups on these improvements, and I did a few presentations back in 2020 on the same. Look for “Have You Really Taken the Time to Know Me? – A G1 GC Saga” online or on my SlideShare. With the improvements, I have noticed that in JDK 17, G1 is reacting well to many applications. I have also noticed that fallback evacuation failures are not that expensive in G1 anymore. Kudos to the G1 team.
Dreamix: After a long time with G1 GC now we have Z GC as well. What are the main differences between these two?
Monica Beckwith: Z GC takes concurrency to the next level. Almost all the work that G1 did in a parallel, stop-the-
world fashion is now done in a parallel, concurrent way, i.e., the stop-the-world pauses are not
there anymore. They are replaced with very short thread local handshakes (which are not
global but thread local) and lots of concurrent activities. There are a few more things that Z GC
introduced to OpenJDK –
– Colored pointers – Z GC relies on storing metadata in the 64-bit object pointer. Z GC in its current form (as of this writing) has 4 colors (Marked0, Marked1, Remapped, Finalizable). The first 3 are used to memory map spaces for the same object. This technique is called virtual address masking.
– Concurrent compaction/relocation – Z GC introduced concurrent copying/relocation (moving objects from the “from” space into the “to” space) to OpenJDK. Most OpenJDK GCs (including G1 GC) use forwarding address that is stored in the header. Z GC doesn’t do that, instead Z GC relies on off-heap forwarding tables. This helps with the immediate release and possible reuse of heap memory.
– Load barriers – Z GC incorporates load barriers (aka LRB (loaded reference barriers)) that help with concurrent marking and concurrent relocation. These load barriers are needed to enforce a set of ‘invariants` that whenever object reference is loaded from the heap – it will be safely “marked through” and it points to the current location of the target object it refers to.
Dreamix: In one of your presentations on Z GC. You state that the GC pause doesn’t increase with the application heap, live dataset, or root set sizes. How does Z GC achieve that?
Monica Beckwith: Like all other OpenJDK GCs, Z GC is a tracing collector that performs both concurrent marking
and concurrent compaction. Due to the concurrent nature of the GC, and the fact that it is capable of incremental marking and partial compaction, Z GC has better control of its unit of work. Z GC also has different GC triggers such as timer-based, allocation rate based, or utilization based, proactive, etc. Z GC can also handle graceful degradation where mutators (Java application threads) are blocked and can be enlisted to help with GC work. As for independence from root set size: During the initial mark phase, where the collector marks objects directly reachable by the roots, the other OpenJDK collectors pause the application to scan the thread stack, which can get increasingly expensive as an STW activity depending on the application’s root set size. In JDK 16, thread stack scanning was moved to the
concurrent phase, reducing the STW phase’s time to a bare minimum.
Dreamix: Are there any upcoming projects that excite you in the JVM or GC area?
Monica Beckwith: There are three things that I am keenly looking forward to in OpenJDK HotSpot VM:
– Generational low-latency collectors like Generational Z GC:
– Project Valhalla: https://openjdk.org/projects/valhalla/
– Escape Analysis improvements to HotSpot VM – Microsoft is working on improving Scalar Replacement for allocation merges: https://bugs.openjdk.org/browse/JDK-8289943. We will introduce this behind the
ExperimentalVMOptions flag in Microsoft’s Build of OpenJDK for JDK 17 and JDK 11.
Dreamix: As you made many contributions to the JVM (NUMA-aware allocator, reduction of redundant instructions, etc.) Which one was the most important, do you think? And could you talk a little bit about it?
Monica Beckwith: The NUMA-aware allocator was the one where I learned the most as we were just getting to know and learn about various access patterns on a NUMA (non-uniform memory access) platform/system. Apart from the CPU caches, there are various system internal buffers and queues that taught me the need for software (JVM in our case) to sympathize with the hardware. Also, I realized how different OSes map memory. E.g., back then some OSes like
Windows and Linux used the first touch principle/policy to map memory to the NUMA domain that the allocating thread touches. Only Solaris was different where it had the concept of locality groups, and one could interleave the main memory across all
lgroups. So, when you have a big shared generational heap on a multi-node NUMA system, it could happen that most of the young generation could end up on a single node on a Windows or a Linux system but on Solaris, it will get evenly interleaved across all nodes. The single node in Windows and Linux’s case would then become a bottleneck (memory bandwidth, buffer latency and cross CPU traffic) for all young generation heap accesses.
Dreamix: What would you suggest to engineers who want to be involved with JVM and pursue a career in Java performance like you?
Monica Beckwith: I always say never pass on any opportunity that makes itself available to you. I started as a
compiler engineer working on simple math library improvements and then moved to JIT compiler optimizations and code generation improvements with respect to x86-64 hardware (caches, pipelines, various predictors, etc.). And then eventually looked at GC and the rest is history. I think that focusing on the core has made me appreciate the concept of immutability and how it is a prime target for various optimizations. Same is true for FFI (foreign function interface) and its nuances.
Dreamix: As you are co-teaching for the TEALS Program and First Lego League. How do you feel about educating youth in programming? What are the benefits for them and the community?
Monica Beckwith: I love how the youth think – you give them a discussion topic and almost always they will come
up with an innovative approach. Just the other day we had a few kids over to carve pumpkins and I saw innovation in terms of utilizing negative space to their advantage (think suspended bats with the dark moon backdrop) and there was one where the person used both front and back of the pumpkin to optimize the display of light effects. With such advanced, creative neurons, we just need to make sure that they have a coach (or a mentor, depending on your
expertise) that is not only encouraging but also an enabler. I think the kids of today are the Nobel prize winners of tomorrow – they will solve world problems if they understand the approach to it – collaboration, investigation, the what and the why.
Dreamix: How do you update yourself about the latest trends in Java?
Monica Beckwith: I keep reading and learning and I look at the code. That’s how I started with Z GC. And then I have a few benchmarks like SPECJBB2015 (non-default options) in my arsenal so that I can try a new feature or optimization out.
Dreamix: Can you recommend a favorite book about programming? What about a favorite book in general?
Monica Beckwith: Ah. I think
The Complete Reference Java is a comprehensive book. For students we also have Pearson Publications
Building Java Programs book that I like. My all-time favorite book is
The Hobbit although I read a lot more of technical books/papers than fiction – so on the tech side, I would say anything to do with Computer Engineering such as
Advanced Computer Architecture or
Operational Amplifiers or
Is there anything else you would like to ask Monica Beckwith? What is your opinion on the questions asked? Who would you like to see featured next? Let’s give back to the Java community together!