When the cpython python interpreter imports a module, it has to
compile your code into an internal representation called bytecode.
That bytecode is then executed by the cpython virtual machine.
In order to avoid performing
that compilation step every time you execute
any given code, cpython keeps a cached version of
the compiled code. That is what
are: cached bytecode.
I am intending to write a programming language whose compile
target is CPython bytecode (just for the sake of learning
about compilers and the cpython virtual machine). The goal is to
be able to write code in this language that is able to interop
with regular python. Code written in my language will be compiled
.pyc file, which I will have cpython believe is the cached
bytecode of a certain fictitious python source code
file. When python code imports that fictitious file,
the interpreter will find my generated bytecode and load and execute
The first step is then being able to write valid
The internal structure of
.pyc files is not officially documented
because it is not part of the python language, but an implementation
detail of the cpython interpreter. After some digging, I found
which explains what the structure for python2
.pyc files is.
As I found later,
after reading cpython’s source code,
the details are slightly different for python3. The structure of a
.pyc file (the structure is probably the same for
all python3 versions but I haven’t checked):
- Magic number: 4 bytes indicating what version of cpython this
.pycfile was made for.
- Modification timestamp: original source code file modification timestamp.
- Source size: size in bytes of the original source code file.
- Code: marshalled module code.
I’ll briefly comment on each of these parts.
This magic number is different for each cpython version. I am sure
you can look that magic number in the cpython source but I found that
the easiest way to find it was having python3.5 generate a
.pyc file and then inspecting what the first 4 bytes were.
You can do this by:
- Create a blank
- Open the python repl (whichever version you want to get the magic number from)
>> import mod
- Python will create a
__pycache__/mod.cpython-35.pycfile that you can inspect with any hex viewer you want.
This is what I get for python3.5:
0000000 16 0d 0d 0a da f0 34 5b 00 00 00 00 e3 00 00 00 0000020 00 00 00 00 00 00 00 00 00 01 00 00 00 40 00 00 0000040 00 73 04 00 00 00 64 00 00 53 29 01 4e a9 00 72 0000060 01 00 00 00 72 01 00 00 00 72 01 00 00 00 fa 1e 0000100 2f 6d 6e 74 2f 64 61 74 61 2f 68 61 63 6b 73 2f 0000120 70 72 75 65 62 61 73 2f 6d 6f 64 2e 70 79 da 08 0000140 3c 6d 6f 64 75 6c 65 3e 01 00 00 00 73 00 00 00 0000160 00
And the magic number (whose size is 4 bytes, as we said) is
The next part is the modification timestamp of
the original source file. This
field is a little endian four bytes unix timestamp. In the previous
hex dump we saw, the timestamp would be
06/28/2018 @ 2:29pm (UTC).
When importing a module, cpython will check whether there is a
.pyc file for it. In that case, it will check whether the
modification timestamp for the module is equal to the one stored
in the cached version. If it is not, it will compile the module again
and create a new
.pyc file. If it is, python will just use the
cached version. This way python knows whether you made any changes
to the source code since the cached version was created.
This is, again, a little endian 4 bytes integer. The source size
in our example is
mod.py was a blank file).
This is another method python uses to determine whether the cached
bytecode corresponds to the current version of the source code file.
If the size of the source file and the quantity found in the
file do not match, python will recompile the module.
Interestingly enough, python2
.pyc files did not contain this field.
I’m not entirely sure why they decided to get this in.
This is the module code object, marshalled. There are multiple resources
on the internet explaining what python code objects are and some of them
are quite good. This talk
is particularly entertaining.
marhal is the internal format cpython
uses for binary serialization. The format itself is not documented,
as was the case with
.pyc structure, due to the fact that it is a
cpython implementation detail.
Python’s 2 marshal format is explained in this post. It has mostly stayed the same for python3. Code objects marshalling will be further explored in a future post.
Edit: the next post is already here: Targeting The Python Virtual Machine II: CPython Marshalling Format.