YJIT is the most memory-efficient Ruby JIT

JohnBooty · on Nov 14, 2023

Wow, Shopify continues to make some heroic improvements here. Kudos, kudos, kudos. Thanks, Shopify folks.

One thing I didn't see discussed in the article was YJIT's memory usage relative to CRuby, the baseline non-JIT version of Ruby. It is certainly possible I missed it; that's been known to happen!

Anyway, the news there is very good. We can see detailed information here:

https://speed.yjit.org/memory_timeline.html#railsbench

Currently Railsbench consumes a peak of ~95MB with CRuby, and a peak of ~110MB with YJIT. So, YJIT delivers 70% more performance while consuming 16% more RAM here. That is a tradeoff I think most people would gladly accept in most scenarios. =)

Real-world speedups will be less, since a "real" web application spends much of its time waiting for the database and other external resources. As the article notes, Shopify's real-world observed storefront perf gain is 27.2%.

YJIT is a success and its future is even brighter.

jez · on Nov 15, 2023

The Memory Usage on Benchmarks section shows CRuby as one of the bars. It is the lowest across the board. YJIT, being implemented in CRuby, uses at least as much memory as CRuby in all the benchmarks (you could maybe imagine one day JIT’d code using less memory in specific situations by compiling code to a more efficient algorithm that skips allocations the interpreter would make, but it appears that is not the case yet today).

Freaky · on Nov 14, 2023

> We were very generous in terms of warm-up time. Each benchmark was run for 1000 iterations, and the first half of all the iterations were discarded as warm-up time, giving each JIT a more than fair chance to reach peak performance.

1,000 iterations isn't remotely generous for JRuby, unfortunately - JVM's Tier-3 compilation only kicks in by default around 2,000, and full tier-4 is only considered beyond 15,000. I've observed this to have quite a substantial effect, for instance bringing manticore (JRuby wrapper for Apache's Java HttpClient) down from merely "okay" performance after 10,000 requests to pretty much matching the curb C extension under MRI after 20,000.

You can tweak it to be more aggressive, but I guess this puts more pressure on the compiler threads and their memory use, while reducing the run-time profiling data they use to optimize most effectively. It perhaps also risks more churn from deoptimization. I kind of felt like I'd be better off trying to formalise the warmup process.

It's rather a shame that all this warmup work is one-shot. It would be far less obnoxious if it could be preserved across runs - I believe some alternative Java runtimes support something like that, though given JRuby's got its own JIT targetting Java bytecode I dare say it would require work there as well.

Twirrim · on Nov 14, 2023

Agreed, the JVM's thresholds for method compilation is higher than the test strategy seemed to account for.

Also, quoting from the site:

    TruffleRuby eventually outperforms the other JITs, but it takes about two minutes to do so. It is also initially quite a bit slower than the CRuby interpreter, taking over 110 seconds to catch up to the interpreter’s speed. This would be problematic in a production context such as Shopify’s, because it can lead to much slower response times for some customers, which could translate into lost business.

There's always ways around that, for example pushing artificial traffic at a node as part of a deployment process, prior to exposing it to customers. I've known places that have opted for that approach, because it made the best sense for them. The initial latency hit of JIT warm-up wasn't a good fit for their needs, while every other aspect of using a JIT'd language was.

As ever, depends on the trade-off if it's worth the extra work to do that. e.g. if I could see after 5-10 minutes that TruffleRuby was, say, 25% faster than YJIT, then that extra engineering effort may be the right choice.

edit: Some folks throw traffic at nodes before exposing them to customers to ensure that their caches are warm, too. It's not necessarily something limited to JIT'd languages/runtimes.

MarkSweep · on Nov 15, 2023

If you are able to snapshot the state of the JIT, you can do the warming on a single node. The captured JIT state can then be deployed to other machines, saving them from spending time doing the warming. This increases the utilization of your machines.

While this approach sounds like a convoluted way to do ahead of time compilation, I’ve seen it done.

Twirrim · on Nov 17, 2023

IBM's JVM used to support it at one stage, not sure if it still does or if other JVMs have picked it up.

Freaky · on Nov 19, 2023

It appears to have a cool-sounding JIT server mode, allowing multiple clients to share a caching JIT compiler which does most of the heavy-lifting:

https://www.usenix.org/conference/atc22/presentation/khrabro...

https://developer.ibm.com/articles/jitserver-optimize-your-j...

It also has a "dynamic AOT compiler", so first-run stuff can be JITed and cached for future execution instead of it all starting out interpreted every time.

Twirrim · on Nov 20, 2023

The shared class cache was the thing I was thinking of, I think.

https://developer.ibm.com/tutorials/j-class-sharing-openj9/

Looks like maybe there is similar for OpenJDK & Oracle Java as of version 12 (I think it is?)

https://docs.oracle.com/en/java/javase/19/docs/specs/man/jav...

maxime_cb · on Nov 14, 2023

It is enough iterations for these VMs to warm up on the benchmarks we've looked at, but the warm-up time is still on the order of minutes on some benchmarks, which is impractical for many applications.

ericb · on Nov 15, 2023

I have a compiler that eventually produces 10x faster bytecode than the current JRuby. It takes a bit of warmup time.

The first command is "Let there be Light."

compumike · on Nov 14, 2023

Also for a practical tip on YJIT memory usage, note that there is a "--yjit-exec-mem-size" option, see https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co... for more details. (This command-line argument is mentioned in the paper https://dl.acm.org/doi/10.1145/3617651.3622982 but not in this blog post about the paper.)

At Heii On-Call https://heiioncall.com/ we use:

    ENV RUBY_YJIT_ENABLE=1                                                                                                             
    ENV RUBYOPT=--yjit-exec-mem-size=16

in our Dockerfile for our Rails processes.

JohnBooty · on Nov 14, 2023

Wow, that's interesting and it seems a little crazy? From the docs:

    When JIT code size (RubyVM::YJIT.runtime_stats[:code_region_size]) 
    reaches this value, YJIT triggers "code GC" that frees all JIT 
    code and starts recompiling everything. Compiling code takes 
    some time, so scheduling code GC too frequently slows down your 
    application. Increasing --yjit-exec-mem-size may speed up your 
    application if RubyVM::YJIT.runtime_stats[:code_gc_count] is 
    not 0 or 1.

https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co...

It just dumps all the JIT-compiled code? I'd expect to see some kind of heuristic or algorithm there... LFU or something.

The internals of a JIT are essentially black magic to me, and I know the people working on YJIT are super talented, so I am sure there is a good reason why they just dump everything instead of the least-frequently used stuff. Maybe the overhead of trying frecency outweighs the gains, maybe they just haven't implemented it yet, or maybe it's just a rarely-reached condition.

(I hope a YJIT team member sees this, I'm super curious now)

byroot · on Nov 14, 2023

As @xerxes901 said, there's some major challenge in freeing just one method code as it's not necessarily contiguous, and also it's of very variable size so it would generate lots of fragmentation. The allocate would need to be much more complex too to compensate.

But the team reasoning is that compilation isn't that slow, and while the code is freed, the statistics that drives the compilation are kept, so most of the work is already done.

Also the assumption behind code GC is that applications may experience a "phase change" e.g. the hottests code path at time t1, may not be so hot at time t2. If this is true, then it can be advantageous to recompile the hottests paths once in a while.

But that assumption is a major subject of debate between myself and the YJIT team, hence why I requested a `--yjit-disable-code-gc` flag for experimentation, and in 3.3 code GC will actually be disabled by default.

JohnBooty · on Nov 15, 2023

Huh! Thank you, that's helpful and informative. Thanks and for your contributions.

It definitely feels like the sort of feature for which there's no universal "best" default. A lot of applications might have "phase changes" and a lot of applications might not.

I would think that long-running apps (like Rails apps) would generally fall into the latter category.

xerxes901 · on Nov 14, 2023

I don't work on YJIT but I _think_ i know the (or maybe an) answer to this. The code for a JIT'd ruby method isn't contiguous in one location in memory. When a ruby method is first compiled, a straightline path through the method is emitted , and branches are emitted as stub code. When the stub is hit, the incremental compilation of that branch then happens. I believe this is called "lazy basic block versioning".

When the stub is hit the code that gets generated is somewhere _else_ in executable memory, not contiguous with the original bit of the method. Because these "lazy basic blocks" are actually quite small, the bookkeeping involved in "where is the code for this ruby method" would actually be an appreciable fraction of the code size itself. Plus you then have to do more bookkeeping to make sure the method you want to GC isn't referred to by the generated code in another method.

Since low memory usage is an important YJIT goal, I guess this tradeoff isn't worth it.

Maybe someone who knows this better will come along and correct me :)

JohnBooty · on Nov 15, 2023

Seems you hit the nail on the head. Thanks for the informative answer!

booleanbetrayal · on Nov 14, 2023

Any recollection on how you arrived at the --yjit-exec-mem-size value? We've been running YJIT in production for some time, but haven't looked into tuning this at all.

JohnBooty · on Nov 14, 2023

Not parent poster and do not have production YJIT experience. =)

My guess is that you would monitor `RubyVM::YJIT.runtime_stats[:code_region_size]` and/or `RubyVM::YJIT.runtime_stats[:code_gc_count]` so that you can get a feel for a reasonable value for your application, as well as know whether or not the "code GC" is running frequently.

https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#pe...

compumike · on Nov 14, 2023

That’s exactly right. Our code_region_size levels off a bit under 8 MB, so we set the limit to 16. In practice we see code_gc_count stays at 0.

booleanbetrayal · on Nov 14, 2023

Will have to look into this. Thanks for the suggestion!

ksec · on Nov 15, 2023

Off Topic, but I thought Heii On-Call was done on Crystal? Or did I mixed it with something else?

titzer · on Nov 15, 2023

Nice work! JIT compilers are a multi-dimensional tradeoff space: memory consumption, startup time, and execution time. I've been experimenting with different ways of visualizing these three dimensions in a tradeoff space. I like the iteration time over time (showing warmup time) but that graph takes a lot of space and is necessarily per-benchmark (and only works with benchmarks with clearly repeated iterations). In my recent JIT paper I went for a scatter plot of two dimensions (setup time and code quality).

It's here if anyone's curious: https://arxiv.org/abs/2305.13241

pjmlp · on Nov 15, 2023

I always like collecting such papers, thanks.

the-alchemist · on Nov 15, 2023

Oh, benchmarking.

Great work, but maybe I'm missing something: I'm more interested in "performance per MB", i.e., per dollar USD, versus "how much memory does it use?"

I feel like that's not fair to systems that will trade performance and memory usage if given the chance. If you give Linux 10GB of memory and only use 2GB, it'll use the rest as a page cache, or memory cache, or anything. Unused RAM is wasted RAM.

Same for the JVM. Maybe it's deep in the weeds, but I don't see any setting of max memory (-Xmx, or -mx in the Java world). Same as any operation system, if you give it more memory, it'll use it.

Also, just like you'd want to generate and pre-warm your caches, you'd want to use the same for a JIT, a cache of sorts.

I'd also like to see these benchmarks using GraalVM native image, which tends to use a lot less memory and reach peak speeds much much faster (seconds, instead of minutes).

alberth · on Nov 15, 2023

Seems like Truffle Ruby is 50-100% faster in most tests.

But uses 5-10x more memory.

larodi · on Nov 15, 2023

If only Perl crowd focused on stuff like JIT improvement and cross compilations, rather than Raku madness, the lang could've been running strong in 2023. Hate to admit it, but rubyists got this all better.

lizmat · on Nov 15, 2023

I can assure you, nobody in the Perl crowd is doing any Raku.

larodi · on Nov 15, 2023

And never did.

xiaodai · on Nov 15, 2023

efficient jit is an oxymoron in the ruby world