When the cpython python interpreter imports a module, it has to compile your code into an internal representation called bytecode. That bytecode is then executed by the cpython virtual machine. In order to avoid performing that compilation step every time you execute any given code, cpython keeps a cached version of the compiled code. That is what .pyc files are: cached bytecode.

I am intending to write a programming language whose compile target is CPython bytecode (just for the sake of learning about compilers and the cpython virtual machine). The goal is to be able to write code in this language that is able to interop with regular python. Code written in my language will be compiled to a .pyc file, which I will have cpython believe is the cached bytecode of a certain fictitious python source code file. When python code imports that fictitious file, the interpreter will find my generated bytecode and load and execute it instead.

The first step is then being able to write valid .pyc files. The internal structure of .pyc files is not officially documented because it is not part of the python language, but an implementation detail of the cpython interpreter. After some digging, I found this post, which explains what the structure for python2 .pyc files is. As I found later, after reading cpython’s source code, the details are slightly different for python3. The structure of a python3.5 .pyc file (the structure is probably the same for all python3 versions but I haven’t checked):

  1. Magic number: 4 bytes indicating what version of cpython this .pyc file was made for.
  2. Modification timestamp: original source code file modification timestamp.
  3. Source size: size in bytes of the original source code file.
  4. Code: marshalled module code.

I’ll briefly comment on each of these parts.

Magic Number

This magic number is different for each cpython version. I am sure you can look that magic number in the cpython source but I found that the easiest way to find it was having python3.5 generate a .pyc file and then inspecting what the first 4 bytes were.

You can do this by:

  1. Create a blank mod.py file.
  2. Open the python repl (whichever version you want to get the magic number from)
  3. >> import mod
  4. Python will create a __pycache__/mod.cpython-35.pyc file that you can inspect with any hex viewer you want.

This is what I get for python3.5:

0000000 16 0d 0d 0a da f0 34 5b 00 00 00 00 e3 00 00 00
0000020 00 00 00 00 00 00 00 00 00 01 00 00 00 40 00 00
0000040 00 73 04 00 00 00 64 00 00 53 29 01 4e a9 00 72
0000060 01 00 00 00 72 01 00 00 00 72 01 00 00 00 fa 1e
0000100 2f 6d 6e 74 2f 64 61 74 61 2f 68 61 63 6b 73 2f
0000120 70 72 75 65 62 61 73 2f 6d 6f 64 2e 70 79 da 08
0000140 3c 6d 6f 64 75 6c 65 3e 01 00 00 00 73 00 00 00
0000160 00

And the magic number (whose size is 4 bytes, as we said) is 0x160d0d0a.

Modification timestamp

The next part is the modification timestamp of the original source file. This field is a little endian four bytes unix timestamp. In the previous hex dump we saw, the timestamp would be 0x5b34f0da, which corresponds to 06/28/2018 @ 2:29pm (UTC).

When importing a module, cpython will check whether there is a cached .pyc file for it. In that case, it will check whether the modification timestamp for the module is equal to the one stored in the cached version. If it is not, it will compile the module again and create a new .pyc file. If it is, python will just use the cached version. This way python knows whether you made any changes to the source code since the cached version was created.

Source size

This is, again, a little endian 4 bytes integer. The source size in our example is 0x00000000 (mod.py was a blank file).

This is another method python uses to determine whether the cached bytecode corresponds to the current version of the source code file. If the size of the source file and the quantity found in the .pyc file do not match, python will recompile the module.

Interestingly enough, python2 .pyc files did not contain this field. I’m not entirely sure why they decided to get this in.

Code

This is the module code object, marshalled. There are multiple resources on the internet explaining what python code objects are and some of them are quite good. This talk is particularly entertaining. marhal is the internal format cpython uses for binary serialization. The format itself is not documented, as was the case with .pyc structure, due to the fact that it is a cpython implementation detail.

Python’s 2 marshal format is explained in this post. It has mostly stayed the same for python3. Code objects marshalling will be further explored in a future post.

Edit: the next post is already here: Targeting The Python Virtual Machine II: CPython Marshalling Format.