How the CPython compiler works

xaedes · on Sept 24, 2020

I once was interested what exactly happens when executing python code, asking myself questions like "Why can't I make a fast for loop? Where are the fast integers? Can't I write fast for loop directly in bytecode, akin to asm?"

The python bytecode interpreter mainloop can be found in ceval.c. There is a big switch case for each python bytecode opcode starting at https://github.com/python/cpython/blob/master/Python/ceval.c...

One can see even the simplest operations call quite a lot of stuff. I gave up on fast pure cpython after reading ceval.c...

sigwinch28 · on Sept 24, 2020

I recall from my days hacking away on the BEAM (the de facto Erlang VM) that it loads code as directly threaded to improve performance.

The BEAM loads whole modules at a time, and as part of this process it means that instruction names in the bytecode are essentially replaced with the address of the code which interpret that instruction.

This means that once loading is complete there's no big switch statement to visit: the VM jumps directly to the part of its own code which interprets that instruction.

sigwinch28 · on Sept 24, 2020

Oh wow, my brain was cooked when I wrote that.

The BEAM loads bytecode into memory and then goes through the loaded code to directly thread it. It's not an inherent consequence of code loading in general.

https://en.wikipedia.org/wiki/Threaded_code#Direct_threading

kstrauser · on Sept 24, 2020

Officially, as in I can't point you to this in writing off the top of my head but I've heard Guido say it numerous times, part of the design goal of CPython is to be as understandable and approachable as possible. That code you linked doesn't include stuff like compiling opcodes directly to assembly and then executing them natively, but it is very straightforward. I mean, I'm not a C guru by any stretch but I can look at that and tell immediately how I could go about adding my own opcodes to the VM.

Jasper_ · on Sept 24, 2020

But it's not like every tiny optimization will turn the runtime into unreadable soup. But that's been the defense every time someone proposes adding some extremely basic optimizations [0]; these things aren't difficult and any VM engineer will understand them. And even if not, it's not that that hard to learn about basic method caches. Write a few blog posts, or a giant block comment. Stuff's not arcane magic, it's the foundation of any PL implementation these days.

There are also a lot of programming language who don't bother touching Python because the team doesn't seem to care about runtime performance as their responsibility. There's a lot of excellent talent out there, and they're mostly being denied and shot down.

[0] https://lwn.net/Articles/754163/

kstrauser · on Sept 24, 2020

I do agree with you on that. I think there's a happy path where you keep the high-level architecture clean and approachable, and move the tricky bits out to the edge where only a few people will have to know and care about them. That seems to work out pretty well for PostgreSQL, which on the surface is purely black magic but by all accounts is quite nice under the hood.

throwaway894345 · on Sept 24, 2020

Perhaps, but it's unfortunate that so many of Python's design objectives are (in practice if not in theory) so limiting with respect to performance, tooling, package management, maintainability, and other important dimensions of professional software development.

Of course, it doesn't mean there aren't companies who can't be successful in spite of these limitations--if you make enough money you can afford to throw money at these problems and make them better (you don't need efficient compute or parallelism if you can afford to pay a team to set up and operate a Spark cluster for you not to mention the cost of the additional VMs and the additional effort to write spark instead of a native language).

kstrauser · on Sept 24, 2020

I get what you’re saying, but in practice I’ve been using Python for a while, since 1.x, and I haven’t run into many of those potential issues. Some people have, obviously, and I wouldn’t discount them. I think the differentiator is that while Python isn’t great when you care about every scrap of performance, there are a huge number of problems where that isn’t a requirement. For instance, my employer’s web stack is built around Python and profilers show that we spend less than 5% of a request’s time running Python code. The other 95% is waiting for database queries and stuff like that. For us, Python is an excellent fit because it’s really good at the other parts of API server development and we’re nowhere near CPU bound. And turns out, a lot of websites have realized that they’re in the same boat.

I wouldn’t recommend pure Python (that is, without Numpy or Scipy) for machine learning, but it’s great in its own niches which are actually pretty enormous.

throwaway894345 · on Sept 24, 2020

> I think the differentiator is that while Python isn’t great when you care about every scrap of performance

This is an understatement. Python is the slowest mainstream language. JS is about an order of magnitude faster. Go, C#, Java, etc are about 2-3 orders of magnitude faster. C, C++, and Rust are nearly 4 orders of magnitude faster. If you might ever care about performance at all, Python is limiting.

No doubt that if all you're ever doing is a little glue code that dispatches to a database (or to ML models or whatever), then Python is fine. The problems come about when you try to do more interesting things, especially involving a complex Python object model that doesn't map neatly to Postgres or Pandas. I've never worked on a Python project where this wasn't the case, and where the organization wasn't doing some desperate optimization (e.g., standing up a Spark cluster) to stave off a ridiculously gracious timeout limit (e.g., 60s for browser timeouts). Meanwhile, had the same server been written in Go, the naive implementation would have subsecond response times and a little bit of trivial optimization (the kinds of optimizations that aren't available in Python for a myriad of reasons) could get it down to double digit milliseconds.

Since we rarely know all of the types of bottlenecks a project will encounter in its years of service, Python has become something of a last resort for me (especially given all of the great options that exist nowadays). And none of this is to say anything about the tooling, distribution, or package management problems that plague Python either.

kstrauser · on Sept 24, 2020

Sure, and we agree: you wouldn’t use pure Python for something where every scrap of performance matters. I’m just saying, I’ve used Python extensively for a couple of decades now, from web services, to orchestrating devops tooling, to being the management interface that controlled a high-speed/low-latency service, to driving pipelines that moved around terabytes of data, doing number crushing with Numpy. In all of those cases, Python was a delight and was not the bottleneck. So while I don’t really disagree with you, I think you’re giving it the short shrift to say it’s fine for just “a little glue code”. It’s amazing how many practical problems in software engineer come down to gluing systems together in new and cool ways.

throwaway894345 · on Sept 24, 2020

> It’s amazing how many practical problems in software engineer come down to gluing systems together in new and cool ways.

Right, but the difference is whether you will ever have to do more than "just a little glue" and making sure you can easily delegate to a different tool to handle those cases. In my experience, there are precious few cases where you never have to do more than just a little glue. Basically any application in which Python owns the data model (e.g., the data model is a non-trivial tree of Python class objects), this is incompatible.

> Sure, and we agree: you wouldn’t use pure Python for something where every scrap of performance matters.

I don't want to be disagreeable, but this is the sort of thing I would say about Go or Java--you shouldn't use it for cases where every scrap of performance matters. But there's an enormous gap between the things Python can do and the things that Go and Java can do. Moreover, if you do run into problems with Go or Java, there's a lot of optimization available to you (at least not in the general case, although there are niches where you can gain a lot of performance with pandas or similar).

Moreover, there's just not a good reason to choose it these days. There are better options without the traps.

hn_acc_2 · on Sept 24, 2020

Not sure about the first part...How is Python's approachable design "so limiting" to all those dimensions?

Nobody writes performance-critical code in pure python.

Not sure how "tooling" is bad, what would you say is limited there?

Package management, again, what package management problems are unique to Python? Many people say this but it seems the problems they bring up are not unique to pip or the python ecosystem, same problems are found with rubygems, npm, Maven, etc...

Maintainability is a responsibility of developers and not a programming language, and unmaintainable code can easily be written with any language. However I'd argue Python should score positive points for maintainability; one of the languages I feel most comfortable picking up old code from someone else and groking it easily.

throwaway894345 · on Sept 24, 2020

> How is Python's approachable design "so limiting" to all those dimensions?

One of Python's "design goals" is to integrate well with C by exposing every detail of the Python interpreter to C extensions. Since Python's performance is so abysmal, the ecosystem has come to lean heavily on C-extensions, and because the ecosystem leans so heavily on C-extensions, very few changes can be made to the CPython interpreter without breaking compatibility with the ecosystem--if we can't change the interpreter, we can't optimize it, and we're damned to a world in which Python is slow. Pypy is doing yeoman's work by building a new JITed interpreter that makes pure-Python code a lot faster, but it still has a lot of compatibility problems with important C packages. For instance, you can't talk to a Postgres database without using an obscure package that hasn't seen a commit in years.

> Nobody writes performance-critical code in pure python.

Not successfully, no. But if you buy the marketing, you'll be led to believe that Python has an answer to every performance problem. "Go ahead and start your project in Python. Don't worry about performance--if your program is too slow, you can just $X" where "X" is one of "rewrite the slow bits in C" or "use pandas" or "use multiprocessing". No one tells you that those options really fall over for a huge swath of real-world workloads, for the same reason: in many/most cases, the cost of de/serializing Python objects is greater than the savings from C or parallelism. It's only economical for those precious few cases where you can do a lot of consecutive work outside of Python.

> Not sure how "tooling" is bad, what would you say is limited there?

Precious little static analysis is available for Python, documentation generation options are generally bad (partly because they can't just generate the type information per the previous static analysis point, but also because they make a whole host of bad decisions, like putting everything on the same page and making you scroll around to figure out which class's __init__ method you're looking at presently) and you're still on the hook for operating the CI tools that generate and publish the documentation packages, tools tend to be written in Python and thus are really slow (e.g., formatters, package managers, etc), no static analysis means no dead code elimination and thus an enormous installed footprint well into the hundreds of megabytes (this problem is exacerbated by the weight of OOP in the Python ecosystem, which means everyone who wants a Book data structure depends on the whole universe of things that people do with Books), static distribution is still a joke--you end up bundling 250mb zip files and you still need to have the right version of Python and the right .so/.dlls installed on the target system, etc.

> Package management, again, what package management problems are unique to Python?

In order to figure out what a Python package's dependency tree looks like, you have to download the whole thing. This makes it difficult to have performant package managers (or rather those that are performant are unsafe because they splat things into the python environment and punt on making sure they don't have multiple versions of some transitive dependency). Python is also held hostage by an ecosystem of C-extensions, so its packages have to support the whole universe of terrible C package management decisions. Also, there is still no production-ready Python package manager that supports reproducible builds (i.e., respects lockfiles). I don't know about Ruby, but NPM, Go, and Rust don't have these issues and I'm pretty sure Java, C#, and Ruby don't either.

> Maintainability is a responsibility of developers and not a programming language, and unmaintainable code can easily be written with any language

It's a lot easier in a language without any rails to guide developers toward good development practices (by making it disproportionately harder to write hacky code), by which I mostly mean a static type system. Mypy is gaining traction, but it's still not in the same ballpark as other languages' type systems and it's moving at a snail's pace (no doubt other languages benefit from investment or else static typing was built into the original design, but those excuses don't make my team's code more maintainable). Not only is it rails to guide them on the right path, but it's also things like "type documentation is always correct" and "refactoring is easy so people actually do it".

ynik · on Sept 24, 2020

> if your program is too slow, you can just $X

Also, some of those $X are mutually exclusive.

If you write a lot of the performance-sensitive code in C; you're going to have complex C data structures. Now if you want parallelism, well too bad: you can't use multiprocessing, because you can't easily share your C data structure across multiple processes.

To actually use multiple cores without getting killed by the GIL, you end up having to replace a lot of Python code with C -- not just the most performance critical portion.

Copying a multi-GB data structure for each CPU core would take way too much memory, so we tried doing stuff with shared memory, but it's complicated. We spent months of developer time on this and still can't really scale beyond two cores, for something that would be embarrassingly parallel in any other programming language :(

The "mixing C and Python" solution is a trap. If performance might be important at any point in the future and you use Python; better plan for a complete rewrite in a different language.

d0mine · on Sept 24, 2020

I had success using Cython for performance (numerical simulations mostly). nogil is your friend for multthreading.

d0mine · on Sept 24, 2020

> a lot of consecutive work It might depend on your specific domain but most performance problems I've encountered are in this class i.e., I have the opposite experience: cases when performance problems can't be solved because Python is used are rare.

--- On reproducible builds: given your examples from other languages, Python has such tools too e.g., pip-tools package.

throwaway894345 · on Sept 24, 2020

That’s what I’ve heard but I’ve been burned by a lot of tools that have promised reproducible builds for Python. They always fail critically for one reason or another. If piptools is the holy grail, then great, but I’ll wait until it’s ubiquitous in the ecosystem.

devxpy · on Sept 24, 2020

You should look at numba. In some special cases, you can actually write for loops in pure python that are faster than C loops.

JesseMeyer · on Sept 24, 2020

Are you referring to Numba's ability to offload certain loops into a GPU kernel?

Otherwise, Clang has access to the same optimizations that Numba has as they both share LLVM as their optimizing compiler. Beyond that, I think a fairer comparison is C w/ OpenMP vs Numba for parallel processing if syntactic brevity is the metric.

basic_bgnr · on Sept 24, 2020

Can you point me toward something that's alternative to switch statement for jumping opcodes. How can we speed things up?

bjoli · on Sept 24, 2020

The other comment mentions computed Goto's. I have used those with big success when writing streaming parsers that work per-char (like a typical JSON or csv-parser). I can't remember exactly, but I think switching from a switch statement to computed gotos cut something like 10% from execution time due to better branch predictions in the CPU. That is pretty huge, considering it was a pretty small change to the char dispatch.

status_quo69 · on Sept 24, 2020

Do you mean a switch statement in pure-python that avoids certain opcodes, or a way to avoid executing opcodes in the main interpreter switch statement?

ir123 · on Sept 24, 2020

Noob question probably, but wouldn't an Enum Map or something similar be more efficient?

anon946 · on Sept 24, 2020

For more than a few cases in the switch:

If the case values are dense (such as just the integers from 0 to 9), the compiler will turn it into a jump table.

https://godbolt.org/z/1Khahr

If too sparse for jump table, then typically will be turned into a binary search tree.

hn_acc_2 · on Sept 24, 2020

It depends on your compiler.

But as a general rule, the programmer will have a hard time "outsmarting" the compiler by using different objects or abstractions.

Compiler authors have spent many hundreds of combined hours making C code run as fast as possible, and optimizing the switch statement is something they've most certainly done to death.

I.e. if your control flow is switch-case-like, use a switch-case.

Jasper_ · on Sept 24, 2020

Not always. Computed gotos can be faster, because it can relax some restrictions that the switch statement can't amke. Most VM implementations, including CPython, use computed gotos for a not insignificant performance improvement.

https://github.com/python/cpython/blob/master/Python/ceval.c...

miguendes · on Sept 24, 2020

Great stuff. I'm very interested in knowing more about python's internals. I recently brought Anthony Shawn's Cpython internals book but haven't read it yet. [1]

[1] https://realpython.com/products/cpython-internals-book/

Rochus · on Sept 24, 2020

Didn't know this one, thanks. Have the ony by Oby Ike-Nobu. Interesting how many "phython internals" books there are now.

r4victor · on Sept 24, 2020

Only these two, as far as I know.

r4victor · on Sept 24, 2020

Hi! This is the second part of my new Python behind the scenes series that covers the compilation of a Python program. I'd be glad to hear your thoughts on this part and the series in general.

tasubotadas · on Sept 24, 2020

Could you tell us why CPython still doesn't have JIT? What's so hard about it (on Python)?

r4victor · on Sept 24, 2020

I'm by no means an expert on JIT compilation, but I can give a less technical answer. I think the primary reason is that most of the developers (both CPython core developers and Python developers) do not consider Python as a tool for writing perfomance critical software. So, any JIT initiative, which requires a lot of resources, doesn't get enough of them. CPython allows to write C extension modules quite easily, so if you need to write some perfomance critical piece of code , you just switch to C and call a C function from Python.

Unladen Swallow was a version of CPython that added JIT support. PEP-3146 describes an unsuccessful attempt to merge it into CPython [1]. Here's a postmortem [2]

Starting with version 3.6, CPython allows to set a custom frame evaluation function on the interpreter and to store JIT compiled code in a code object. This was done to support JIT in a form of a separate module and is described in PEP 523 [3].

This is very interesting topic. I should study it in detail and probably write a post. I hope I'll be able to give a more thorough answer someday.

[1] https://www.python.org/dev/peps/pep-3146

[2] http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospect...

[3] https://www.python.org/dev/peps/pep-0523/

tasubotadas · on Sept 24, 2020

Thanks. That's the answer I was looking for.

CJefferson · on Sept 24, 2020

The biggest problem is that most nontrivial C programs use extensions written in C, and those extensions are written with fairly deep knowledge of CPython, so your JIT has to build most of the same structures as CPython anyway, in case an extension needs them.

chrisseaton · on Sept 24, 2020

In the Ruby JIT I work on what we did was write a C interpreter, and run the C extensions inside the same VM as the Ruby code. That allows you to provide the illusion of using the same structures, but actually allowing you to implement them however you want for better performance.

tasubotadas · on Sept 24, 2020

IMO, that's a very interesting and smart approach :)

tasubotadas · on Sept 24, 2020

Thanks. That's the answer I was looking for.

Rochus · on Sept 24, 2020

There is already https://www.pypy.org/ which is about four times faster than CPython (which is not that much actually compared to e.g. V8, see e.g. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...)

glandium · on Sept 24, 2020

Caveat: pypy is not _consistently_ faster than CPython. There are cases where it is slower, sometimes even by a large margin.

Rochus · on Sept 24, 2020

Of course. It's clearly visible e.g. on https://speed.pypy.org/ that it doesn't show equal performance on all benchmarks. The factor four corresponds to the geometric mean of all benchmarks. And as you say there are benchmarks where it even performs slower than CPython.

devxpy · on Sept 24, 2020

While cpython doesn't have a jit, there does exist an amazing library called numba which will run a subset of python code through a jit built using llvm and make it crazy fast (and multi-core)

Intel has has a library that lets you use pandas in numba code, so it's I feel pretty confident using it in production without breaking cpython compatibility

creata · on Sept 24, 2020

Why should CPython have a JIT?

A good JIT is hard to write, and harder to retrofit. It usually increases startup times, attack surface, and code complexity. Besides, Python already has a handful of JIT-based implementations, and I don't see how adding the reference implementation (which out to be simple!) to that list really improves anything.

tasubotadas · on Sept 24, 2020

You wrote a lot but didn't contribute to answering the question.

creata · on Sept 24, 2020

1. That's not a lot.

2. It's a forum, not everything has to be a direct response.

3. Either way, this was a direct response: CPython does not have a JIT because there's little reason for it to have a JIT, and many good reasons why it shouldn't.

4. If you want a JIT'd Python so badly, why not use PyPy?

xioxox · on Sept 24, 2020

CPython had one as an external module, but unfortunately it's no longer maintained as PyPy was supposed to supersede it: http://psyco.sourceforge.net/

It actually worked quite well, though was pretty memory hungry.

Ennea · on Sept 24, 2020

This could be an interesting read, but so far I've only had time to skim it. What I did notice, however, is that some of your images are too wide and get cut off :)

r4victor · on Sept 24, 2020

I know the site is not well adapted for mobiles. I currently use default Pelican theme, which is not responsive, and have plans to change it. Nevertheless, I have no problems with displaying images on any device (desktop/ipad/iphone), so could you elaborate on this? What device do you use? You can email me: [email protected]

Ennea · on Sept 24, 2020

Desktop, actually. Full HD. The width of the content is only 800px, but some of your images (like diagram2.png) is 1044px wide. And instead of the image being scaled, it is just being cut off.

Edit: Oh yea, there we go. You're using CSS zoom[0], which is non-standard.

[0] https://developer.mozilla.org/en-US/docs/Web/CSS/zoom

lordgrenville · on Sept 24, 2020

Same for me. Firefox desktop on macOS https://pasteboard.co/Jsysfju.png

r4victor · on Sept 24, 2020

Thanks for reporting! Fixed this. I used a non-standard `zoom` attribute, which works in Chrome and Safari. It's a good habit to check how the website looks in all major browsers.

Rochus · on Sept 24, 2020

Interesting, looking forward reading it. I recently bought "Inside The Python Virtual Machine" (see https://leanpub.com/insidethepythonvirtualmachine) which is also recommendable.

r4victor · on Sept 24, 2020

Second this book. I've also made a list of good resources on CPython internals: https://tenthousandmeters.com/materials/python-behind-the-sc...