Java Memory Model

Java owes its success largely to the promise of consistent behavior across different platforms – write once, run anywhere. Java Virtual Machine (JVM) plays a crucial role in providing guarantees about program execution regardless of the underlying hardware and operating system. One of the apparent challenges are performance optimizations. Program instructions don’t necessarily execute in the same order as they appear in the source code. A compiler or even a CPU itself go above and beyond trying to run instructions in parallel on multiple cores. In multi-threaded applications parallelism can lead to unexpected bugs. This post looks at the Java memory model as a framework that allows an informed developer avoid the most common pitfalls associated with writing multi-threaded applications.

In layman’s terms Java Memory Model is a specification (JSR-133) that defines how multi-threaded applications work with shared memory. There are many moving parts. There is RAM – memory shared across all cores. How do threads synchronize when reading or updating an object in the shared memory? There is cache local to each and every core. At which point is this local state flushed to the main memory? Last but not least, what are the optimization tradeoffs of advanced thread synchronization? Java Memory Model establishes a set of rules that allow to reason about behavior of complex programs that are highly concurrent in their nature.

The Java Memory Model was the first attempt to provide a comprehensive threading memory model for a popular programming language.

Wikipedia

The initial version of the memory model received a major overhaul with the release of Java 5.0 in 2004 and it hasn’t changed since then.

Without further ado, let’s delve right into it.

How to Read This Post?

If you are pressed for time or get tired of reading from the start, feel free to skip to the section that sparkles your interest.

  1. Instruction Reordering
  2. Out of Order Execution
  3. Java Memory Model
  4. Field Visibility Problem
  5. Flushing the Memory
  6. Volatile Keyword
  7. Happens-Before Guarantee
  8. Concurrent Writes
  9. Synchronization
  10. Summary

Instruction Reordering

When you write a piece of code you naturally expect the instructions be executed in the order as they appear in your source code. Consider this simple example.

1: x = 1
2: y = 2
3: x = x + y

These statements execute exactly as you would expect. There is no way line 3 would run before both x and y are initialized. Let me make a slight adjustment:

1: x = 1
2: y = 2
3: x = x + 3

Now there is no dependency between x and y. This presents an opportunity for a performance gain. Notice how x is initialized on line 1 and then it is loaded into memory again on line 3. The compiler sees something like this:

Load x
Set x to 1
Store x
Load y
Set y to 2
Store y
Load x
Set x to 4
Store x

The compiler will very likely want to skip unnecessary steps. Why to load and store x twice? Here is what the compiler will probably do:

Load x
Set x to 1
Set x to 4
Store x
Load y
Set y to 2
Store y

This is perfectly fine since the final result is the same. The semantics of the program didn’t change. From the source code point of view, however, the order of execution changed and line 3 executes before line 2:

1: x = 1
3: x = x + y
2: y = 2

Out of Order Execution

Multi-core processors create space of significant performance gains through parallelism. The compiler or even the CPU itself analyses the program and organizes it into blocks that can run in parallel without depending on each other’s results.

Image 1: CPU looks ahead for independent instructions and runs them in parallel on different cores.

Let’s write some real code to prove that instructions are indeed executed out of order.

There are two member variables A and B, both with an initial value of 0. The init method copies the initial values of the member variables to local variables r1 and r2 and then it assigns the member variables with new values. When the results are printed on line 12 , you would expect that the local variables r1 and r2 contain the initial values of A and B, e.g. the expected output is:

r1: 0, r2: 0

The program runs on two threads, just to increase the chance of observing something unusual. Indeed, after a few iterations the output is quite different from what you would expect:

r1: 0, r2: 0
r1: 2, r2: 1
r1: 2, r2: 1
r1: 2, r2: 1
r1: 0, r2: 0
r1: 2, r2: 1
r1: 2, r2: 1
r1: 0, r2: 0
r1: 2, r2: 1
r1: 0, r2: 0
r1: 2, r2: 1
r1: 0, r2: 0
r1: 0, r2: 0
r1: 0, r2: 0
r1: 2, r2: 1
r1: 0, r2: 0
r1: 0, r2: 0
r1: 2, r2: 1
r1: 0, r2: 0
r1: 2, r2: 1

About half of the time the local variables contain the values of A and B that in the source code appear only after the local variables initialization.

The processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

Source: Wikipedia

Yes, out of order execution is intentional and necessary. Keeping a processor idle is like leaving a car engine on, when the vehicle comes to a standstill. It is wasteful.

In a simple program, like the one above, shuffling statements around is nothing to worry about. The semantics remains unchanged after all. How about more complex applications though? What if there are assumptions about a variable having a particular value at a certain point in time? Can we rest assured the results will be consistent? Not at all.

Before we move any further and try to understand the challenges with multi-threaded applications, let’s have a look at how Java memory model works in practice.

Java Memory Model

AMD Ryzen 9 5900 X
AMD Ryzen 9 5900 X: 12 cores, 3.7GHz base clock speed, $549 (June 2021). Source: Wikipedia

Elementary knowledge of how the processor works and what performance trade-offs there are makes the abstract memory model much easier to understand.

Modern computers typically have multiple processors. Each processor itself comprises multiple embedded processors called cores. A core is essentially an execution unit inside a processor. It executes instructions which it receives from the CPU or other cores.

The ability to quickly access memory and cache data is key to a good performance. The faster the access, the more instructions a core is able to crunch, continuously fetching data from memory and saving intermediate results back. The memory access pipeline is layered. There is an important trade-off between speed and size. The larger the memory, the slower the access to it. Memory local to a core (registers, L1 cache) is readily available, but it can only store a limited amount of data. Accessing the main memory (RAM) on the other hand is expensive in terms of time, but there are almost no limits when it comes to the storage space. Another important aspect is visibility. We will delve into this topic in a minute, but for now just note that fast memory remains local to a processor core, whereas the slower main memory is shared across all processors and their cores.

Hint: Speaking of latency, save this gem and keep it close – Latency Numbers Every Programmer Should Know.

Image 2: Physical Memory Model – CPU cache pipeline and main memory, the tradeoff between size and latency.

Multi-core processors execute tasks in parallel. A complex computation is broken down into independent sub-tasks which execute simultaneously on multiple cores. Parallel computations run on multiple threads. Unlike a core – which is the actual hardware a computation runs on – a thread is a virtual component that drives the computation. Threads run concurrently and share CPU and memory.

Image 3: Program execution starts in the main memory, but threads use CPU cache to store computation results. Source: Wikipedia

You can ignore most of Image 3. Just note that a program’s initial state is loaded into the main memory (RAM). This memory is shared by all threads. However, as the threads progress, each with their own agenda, they store intermediary results in the local CPU cache (registers, L1). These results eventually propagate to the shared memory (L2, L3, RAM), where all other threads can see and access them. However, there aren’t any guaranteed timelines of when the computation results appear in the main memory.

Field Visibility Problem

Since threads copy the shared state to a local cache and it takes a while before the final result appears in the shared memory, there is a good chance that some of the threads will work with an outdated version of the shared state.

Image 4: Field visibility problem – the state of a shared variable gets out of sync between the two threads.

Let’s prove this. In the code below, there is a shared variable x and two threads. One of them acts as a reader and only checks for changes in x’s value. It prints out the latest value of x whenever it spots a change. The other thread modifies the value and reports back the new value of x.

Note how the writer thread sleeps on line 29, leaving enough time for the reader thread to observe the new value. However, this is what you get when you run the demo:

reader: x = 1
writer: x = 1
writer: x = 2
writer: x = 3
writer: x = 4
writer: x = 5

As you can see the reader missed 4 updates out of 5! This result was consistent for me, when I ran the program multiple times. You might get a different result, but it is very likely that the reader will always miss at least some of the updates.

The problem is exactly as depicted in Image 4. The updated value of x lands in writer’s local cache and it takes a while before the updated value propagates into the shared memory. Thankfully, there are means of how to make the changes instantaneously visible.

Flushing the Memory

Java Memory Model makes it possible to reason about code execution in a multi-threaded environment, even in the face of optimizations performed by the dynamic compiler, the processor(s) and the caches.

Source: Wikipedia

To resolve the visibility problem shown in the previous section we only need to make a subtle change in our code. We simply have to mark the shared variable x as volatile.

volatile int x = 0;

View source code on GitLab

When you re-run the example you will find out that the reader thread never lacks behind and consistently accesses the latest value of x.

writer: x = 1
reader: x = 1
writer: x = 2
reader: x = 2
writer: x = 3
reader: x = 3
writer: x = 4
reader: x = 4
writer: x = 5
reader: x = 5

Now, what does volatile mean and how exactly does it work?

Volatile Keyword

The purpose of volatile keyword is simple – transparency. Changes made to variables marked as volatile are immediately flushed into the main memory shared by all threads. JVM guarantees that volatile variables are always read from the main memory, never from a cache that’s local to a reader thread. Similarly, any writes are immediately flushed to the main memory instead of being cached locally with a writer thread. Simply put, the latest state of a volatile variable is always visible to all threads.

Using volatile has the same effect as synchronization via monitor locks. Every read of a volatile variable acquires its intrinsic lock before accessing the value. Every write to a volatile variable releases the lock after it is done with making the change. All this implies limitations to out of order execution. The compiler can no longer reshuffle just any instructions as it sees fit. Why? Let me borrow an example from this old article from 2004.

Hint: I highly recommend going through JSR 133 (Java Memory Model) FAQ. It succinctly explains how volatile works behind the scenes as well as other core concepts of the Java memory model.

Consider the following example.

01 class VolatileExample {
02  int x = 0;
03  volatile boolean v = false;
04  public void writer() {
05    x = 42;
06    v = true;
07  }
08
09  public void reader() {
10    if (v == true) {
11    //uses x - guaranteed to see 42.
12    }
13  }
14 }

Source: JSR 133 (Java Memory Model) FAQ

Assume two threads. A reader thread that only ever calls the reader() method and a writer thread that only ever makes use of the writer() method. Notice how the reader thread depends on the shared volatile flag v. Now, even though the shared variable x is not marked as volatile, it would be silly if the reader saw an outdated value of x. The reader relies on impeccable capabilities of volatile after all. For this to work the compiler can no longer swap lines 5 and 6. JVM guarantees that the write to x on line 5 happens before the update of v on line 6.

Happens-Before Guarantee

In order to address the field visibility problem, the volatile keyword imposes restrictions on instruction reordering (out of order execution). This is necessary so that not only the volatile variable itself, but also reads and writes to other surrounding variables are visible to all threads.

In a nutshell the Happens-Before guarantee boils down to the following rules.

  1. Reads and writes that happen before a write to a volatile variable will never occur after the volatile variable is updated.
  2. Reads and writes that happen after a write to a volatile variable will never occur before the volatile variable is updated.

Note how the rules complement each other.

Regarding rule 1, there is a guarantee that whatever instructions happen before a write to a volatile variable are always executed prior to the change of the volatile variable. However, these instructions can be reordered among themselves. That is perfectly fine.

Image 5: Rule 1 of Happens-Before Guarantee

Similarly, updates to a shared state that in the source code appears after a write to a volatile variable cannot be reshuffled to occur sooner than the volatile variable is updated.

Image 6: Rule 2 of Happens-Before Guarantee

As you can see volatile is in essence a means of synchronization. However, using volatile alone is not always enough to make programs thread-safe. Imagine there are multiple writer threads competing to update the same volatile variable. While the final result is always guaranteed to be visible to all reader threads, it is not clear which exact value would that result be – volatile does not shield us from race conditions.

Concurrent Writes

Up until now we only ever had a single writer thread. Let me make a slight adjustment to our example. From now, there will be two writer threads updating the very same volatile variable.

If you run this program multiple times you likely receive different results. That is never a good sign. If you have a closer look you will notice that the perceived value of x varies among the reader and the two writer threads is somewhat inconsistent.

01 writer1: x = 1
02 writer2: x = 2
03 reader: x = 2
04 writer2: x = 3
05 reader: x = 3
06 writer1: x = 3
07 writer1: x = 5
08 reader: x = 5
09 writer2: x = 5
10 writer2: x = 7
11 writer1: x = 7
12 reader: x = 7
13 reader: x = 8
14 writer2: x = 8
15 reader: x = 9
16 writer1: x = 9

In the output above, it all starts well. On line 1 we can see the first update of x , same on line 2 by the other writer thread and that’s reflected by what the reader sees on line 3. Suddenly, there is a gap in the sequence on line 7. On lines 9 and 10 the two writer threads really get out of sync when one of them sees an obsolete value of x. Can you guess why?

Well, it’s simply because of race conditions. Having two writer threads at play makes the shared state exposed to concurrent writes. While volatile ensures visibility, additional synchronization is required to deal with race conditions.

Race Conditions

Before moving on, let’s simplify our example even further, disregard the reader thread and only focus on concurrent updates of a shared state. Suppose we wanted to implement a counter. It won’t get much simpler than that. Our first naive attempt is very unassuming. A private member variable count can only be modified by a making a call to increment() which increases the value of the counter by one. No issues there. However, what happens if there are multiple threads trying to increment the counter simultaneously?

Side note: In the section below the aim is to spin up many concurrent tasks. Creating threads and runnables like I did in my previous examples would be tedious. Java provides ExecutorService – a framework that makes all of this much easier.

It is important to correctly terminate all running tasks before the program shuts down. I have created a utility method that deals with that – see ConcurrentUtils.stop().

As you can see, in my implementation of the naive counter I submit 100 tasks, each of them increments the counter exactly once. If all works as expected the final value of the counter should consistently be 100. However, since all the tasks rush to update a single shared state there are concurrent writes. Some of the updates get lost as a result. This is what I got in one of the iterations.

01: total count is 100
02: total count is 100
03: total count is 98
04: total count is 100
05: total count is 99
06: total count is 96
07: total count is 96
08: total count is 100
09: total count is 93
10: total count is 100

Now, how to shield against concurrent writes?

Synchronization

The simplest possible fix is to mark the method that updates the shared state as synchronized.

synchronized void increment() { count++; }

Synchronized methods enable a simple strategy for preventing thread interference and memory consistency errors: if an object is visible to more than one thread, all reads or writes to that object’s variables are done through synchronized methods.

Source: Oracle

Simple as it is, using synchronized on method level can in some cases significantly decrease throughput (e.g. diminish so called liveliness). Luckily, synchronized can also be applied only to a critical section of code. It doesn’t have much use in our simple case, but here is what you can do to limit the scope of synchronization.

  void increment() {
    synchronized (this) {
      count++;
    }
    // Other computations that are thread safe 
  }

With help of a bit of extra logging it is easy to see how the threads queue up competing for access to the synchronized block.

  void increment() {
    System.out.println(Thread.currentThread().getName() + 
      " wants to enter synchronized block");
    synchronized (this) {
      System.out.println(Thread.currentThread().getName() + 
        " entered synchronized block");
      count++;
    }
    // Other computations that are thread safe
  }
pool-1-thread-9 wants to enter synchronized block
pool-1-thread-2 wants to enter synchronized block
pool-1-thread-7 wants to enter synchronized block
pool-1-thread-1 wants to enter synchronized block
pool-1-thread-6 entered synchronized block

How exactly does synchronized work? Well, to ensure visibility each object has an internal monitor commonly known as intrinsic lock or monitor lock or simply a monitor. Therefore, if a thread enters a synchronized method it acquires the monitor of the object the method belongs to. From that moment on, the thread owns the monitor lock and no other thread can enter the method. Once the thread leaves the method it releases the monitor and the intrinsic lock is up for grabs by any other of the waiting threads.

Synchronization establishes happens-before premise which again is key to visibility. Not only will two threads ever modify shared state simultaneously, but it is also guaranteed that the threads will always see the latest value of the shared state.

One word of caution. Threads must synchronize on the same monitor, otherwise the synchronization won’t work. Of course! If all this is about watching out for changes on a particular object than surely all threads must monitor that very same object.

  void increment() {
    // This won't work!
    synchronized (new SynchronizedCounter()) {
      count++;
    }
    // Other computations that are thread safe 
  }

There is more to handling concurrency in Java. Using synchronized merely scratches the surface as Java provides a whole locking framework. However, that’s a discussion for another day.

Summary

Java Memory Model is a specification (JSR-133) that describes how multi-threaded programs work with shared memory. The model closely follows the actual hardware architecture and establishes rules such that programs can fully benefit from concurrent computations without compromising the shared state. The model also provides means of safe synchronization and clearly explains performance tradeoffs, namely limitations of instruction reordering. Hope you’ve enjoyed this article. Thanks for reading thus far and let me know your thoughts in the comment section below.

Similar Posts