Daniel Barlow pointed out that my previous explanation on certain aspects of the SBCL gencgc was slightly flawed. The size of an allocation region (which determines how often we can use inline allocation on x86-64) actually defaults to two GC pages, not one. Furthermore, it can be adjusted simply by frobbing one magic number in the source. So I repeated the experiment, but this time with different allocation region sizes instead of page sizes. And... The performance improvement of larger regions was marginal. This invalidates my main theory on why larger pages gave better performance in my last test, and I have yet to come up with another one. Ideas welcome.

While thinking about this, I ended up re-reading some great diary entries about improving the GC that Dan wrote a while back. You should read the originals for the full picture, but essentially PURIFY and the gencgc don't mix, since the static space doesn't have a write barrier and needs to be scanned on every GC. The improvements never got quite finished due to (I gather) real life interfering with hacking. I redid the easy part of building an SBCL with data only in the static dynamic space to see how much difference this actually makes.

Turns out it's a huge improvement in many cases. For example the ansi-test suite runs 20% faster:

-  567108 seconds of real time
-  49815225 seconds of user run time
-  25631104 seconds of system run time
-  47 page faults and
-  21,418,564,160 bytes consed
+  429172 seconds of real time
+  39006268 seconds of user run time
+  2590806 seconds of system run time
+  5 page faults and
+  21,316,008,096 bytes consed

This seems like a pretty good deal, so I'll try to polish this into a releasable state at some point.

Update: Oops. I wrote "static" where I meant "dynamic" at one point, which made the whole paragraph nonsense. Thanks to Nikodemus for pointing this out.