Write barriers and an old Ruby bug

Here is a program that crashes Ruby 2.7.0:

hash = { a: Object.new }
GC.start
hash.transform_values!{ Object.new }
GC.start(full_mark: false)
GC.start

The crash message is scary and subtly hints at the presence of a bug:

[BUG] try to mark T_NONE object

In this post, I will explain the reason behind the crash and suggest some ways to avoid similar issues in the future.

Is it a bug in the garbage collector?

Since the program crashes without the help of third party C extensions and crashes even when Ruby is launched with --disable-gems, it’s clear that the problem is in the Ruby runtime. Looking at the C stack trace, we can see that gc.c initiates the crash. How can we prove beyond a reasonable doubt that the GC itself is guilty of this crash? It turns out, there is a method, GC.verify_internal_consistency that can help us out.

Adding a call to this method right after the call to transform_values! in our reproducer, we get a different message:

verify_internal_consistency_reachable_i: WB miss (O->Y) T_HASH -> T_OBJECT
crash.rb:4: [BUG] gc_verify_internal_consistency: found internal inconsistency.

Okay, it looks like some invariant that the GC cares about is not upheld, but what is a “WB miss”?

Incremental mark and sweep

WB stands for write barrier so gc_verify_internal_consistency is telling us that there is a “write barrier miss”. To understand what write barriers are and what they are for, we first need to understand how the GC functions. Once upon a time, Ruby had a stop-the-world mark-and-sweep GC. “Stop-the-world” sounds grandiose, but it basically means no Ruby code is running while the GC is doing work. I made some animated slides to explain the stop-the-world GC algorithm. ¹

This works okay, with one disadvantage: when there are many objects alive, the GC could pause Ruby code execution for an unacceptably long amount of time. Pausing for just 17 milliseconds in a video game could be devastating as that can easily make the game miss its frame rate target.

To deal with this problem, we could break up the work the GC has to do into chunks and interleave them with Ruby code execution. To go back to the video game example, instead of pausing for say 17 milliseconds a single time, we pause for 1 millisecond 17 times, allowing Ruby code to execute between pauses. If the game has 17 milliseconds to render each frame, instead of taking up 17 milliseconds once in a while, blowing through the frame time budget, the GC now takes up 1 millisecond once in a while. The GC potentially takes longer to release garbage objects, because the total amount of work is still the same as before, but that should be okay for a lot of applications.

Write barriers

Great, we now have an incremental GC but we have also introduced a new problem. What happens if we finish marking an object O, let Ruby code execute, and the Ruby code associates O with an object J that would otherwise be unreachable?

We know J is alive since O is alive and all objects referenced by alive objects are also alive. However O is already marked, so the GC will not mark it again to see that O now refers to an object it didn’t see the first time O was marked. Nevertheless, the GC cannot treat J as garbage. We need some way to address this problem.

We could make the Ruby code inform the GC every time it associates an object with another one. This way, when the Ruby code makes changes to an object the GC already marked the GC could be aware of objects that it potentially was unaware of at the time of marking:

The write barrier is a piece of code that informs the GC every time an object starts to reference a new object. In addition to facilitating incremental GC, write barriers also help with generational GC.

How to “miss” a write barrier

Going back to our crash reproducer, we can now guess that Hash#transform_values! is missing a write barrier somewhere. After all, it’s the only thing that can mutate the object reference graph after the GC runs. This indeed was the problem.

There are many garbage collected languages that ship with compilers that can automatically insert write barriers. CRuby does not have that luxury, however, and it is up to the developers to manually insert write barriers where necessary. In this particular case a write barrier was present but erroneously removed.

Consequences

Missing write barriers can make the GC collect live objects and lead to use-after-free. This can lead to crashes like we have seen, or worse, silent data corruption. The following program demonstrates data corruption:

hash = { a: Object.new }
GC.start
hash.transform_values!{ 2 ** 99 }
p hash
GC.start(full_mark: false)
p hash

=begin
$ ruby -v corrupt.rb
ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-darwin19]
{:a=>633825300114114700748351602688}
{:a=>":a"}
=end

Running the GC should not have visible impact on live objects.

Possible ways to prevent this in the future

To quote Koichi Sasada in a paper about CRuby’s write barriers:

In general, inserting WBs correctly is a difficult and time-consuming task because WB-related bugs cause critical issues and it is difficult to debug them.

Indeed, write barrier bugs can be very elusive. I certainly don’t put it past myself to accidentally introduce one. They are hard to spot by testing. To illustrate, try removing or changing lines that start with GC in the reproducer – a lot of the time the program doesn’t crash and appears to work correctly.

I’m going to suggest some ways that can make forgetting to insert and/or accidentally removing write barriers more difficult.

Education and code reviews

It’s easy to forget to check for write barriers because a lot of the time one can compose the desired code change from pre-existing functions that already deal with write barriers correctly. Maybe the only thing that was stopping someone from catching this particular write barrier removal was a reminder to look for it. A gentle reminder on Github’s pull request template could make sense.

I also think more people should understand write barriers enough that they know when to insert them. Write barriers form a very important contract between the GC and the rest of the virtual machine. People making changes should be aware of this contract.

Static analysis

Humans are unreliable and can easily make mistakes. It would be ideal if we could run a program and get an assessment as to whether a write barrier is missing. For this particular case, it seems reasonable to write a Clang-based static analyzer that could catch it. However, I don’t know how useful that would be for unknown cases that might reveal themselves in the future. Also, the ergonomics of static analyzers are hard to get right. Among other potential issues, just a few false alarms can make them too annoying to use.

Nevertheless, maybe it’s worth it to make a best-effort static analyzer to help new contributors and experienced ones alike.

Summary

Write barriers are an important part of Ruby’s incremental GC that updates the GC’s understanding of the object reference graph
CRuby developers are responsible for manually inserting write barriers wherever necessary
Failing to insert write barriers can cause catastrophic failures such as crashes and data corruption

Footnotes

This is a textbook tri-color mark algorithm. Typically, the three colors are white, grey, and black. I chose different colors in an effort to help colorblind readers. ↩︎

2021-07-22

https://alanwu.space/post/write-barrier/