The Implementation of the Python Programming Language


Clinton Jeffery and Richard Saunders




































(C) Copyright 2004, Clinton Jeffery and Richard Saunders. This prepublication material is not to be copied or linked without permission.



































Contents

  1. Chapter 1: Introduction
  2. Part I: the interpreter front-end
  3. Part II: the virtual machine
  4. Part III: the runtime system
  5. Part IV: instrumentation






































Preface

What, another book on Python? There are lots of books on how to write programs using Python, but this book is not one of them! Instead, this book is a guide to how the Python language is implemented, as of Python version 2.3.4. It is intended for Python language maintainers and extenders, for implementors of other similar languages, and for advanced students interested in programming language compilers and interpreters.

The book originated when the authors began a project to extend Python's execution monitoring capabilities, and discovered the profound lack of documentation on the implementation. We wrote this book as our own study guide, and also to support the study of Python's implementation by our own students.

This book's authorial biases are those of outsiders: we are expert programmers and language implementors who can write Python programs, but not Python experts, nor part of the Python cult. We will be as fair and objective as we can, striving for accuracy and higher quality than some of the Python-related documentation we have seen.
















Chapter 1: Introduction

Python Origins and Design Influences

Python's history has no doubt been told better elsewhere. Python is a descendant of ABC, developed at CWI in the Netherlands by Guido van Rossum. ABC was designed for simplicity and interactivity, in some ways an improvement over BASIC for beginning programmers. Unlike ABC, Python was intended to be suitable for experts writing many kinds of programs. Python's start, as an unsupported one-student hobby effort, predates and resembles Linus Torvald's invention of Linux. In addition to its simplicity and interactivity, Python's popularity as a scripting language is also due to its clean object-oriented design, which allows Python programs to be far more readable and maintainable than similar scripting languages such as Perl and Tcl. Also, unlike the authors of Perl and Tcl, van Rossum bothered to actually learn from and occasionally borrow the best ideas from important languages including Lisp and Icon. Arguably, van Rossum came up with a better syntax than Lisp, and a more extensible but weaker and clumsier syntax and semantics than Icon. In any case, Python is much better for having learned from other languages, something untrue of a surprising number of scripting languages.

Organization of the Source Code

Python's source code is organized into around 250 directories in a fairly broad-but-shallow hierarchy. Many of the directories are named for specific platforms, such as PC or Mac. Others, such as Demo, Doc, and Misc contain non-code which we will not consider here. The Include/ and Python/ directories are obvious places for a new reader of the C implementation to start. Other directories with core C code are Modules/, Objects/, and Parser/. Lib/ contains Python standard library code. Python's Grammar/Grammar gives a non-YACC EBNF, which is implemented handwritten code in Modules/parsermodule.c

Header Files

Include/ contains around 70 C header files comprising around 9K lines of code. About 50 of the .h files are less than 100 lines in size. The logical starting point for looking at Python's headers is Python.h, which mainly includes other header files, including a small number of "universal" system includes, as well as most (~45) of the other Python header files from this directory.

Source Files

The Python/ directory contains around 50 C source files comprising some 30K lines of code. About half of these files are under 100 lines long, and several of these small files contain semistandard functions that are in the standard C libraries on some but not all platforms.

Configuration and Building




















Chapter 11: Data Representation

The PyObject Structure

Type PyObject, defined in Include/object.h, provides a uniform interface for all Python objects. It includes a reference count, a pointer to an object type, and optionally at interpreter build-time, a pair of object pointers with which every heap object in the system can be inserted into a doubly-linked list.


















Chapter 12: Built-in Functions

File Python/bltinmodule.c contains a great deal of the C code for functions that are built-in to the runtime system. Functions in this file take and return pointers to PyObject instances for the most part; their prototypes vary slightly depending on whether they operate on a single parameter, or on an arbitrary number of parameters.

Built-in Methods

Built-in methods are are all C static functions, visible through a static array builtin_methods[], which in turn is used to initialize module __builtin__. Built-in methods are exemplified by the abs() function, given below. The C function is named builtin_abs(), but that is just a wrapper providing the interpreter with a standard parameter signature. The actual logic for calculating the absolute value of a Python object is determined by calling an internal function PyNumber_Absolute(), defined in Objects/abstract.c, and used in three other places in the runtime system, in addition to being used to implement the built-in function.
static PyObject *
builtin_abs(PyObject *self, PyObject *v)
{
	return PyNumber_Absolute(v);
}

PyDoc_STRVAR(abs_doc,
"abs(number) -> number\n\
\n\
Return the absolute value of the argument.");



















Chapter 14: Memory Management

Memory management is one of the most important areas of the Python runtime system, and fortunately, one of the best documented in the source code. Extensive documentation can be found in comments in Objects/obmalloc.c.

Python uses three progressively higher level abstraction levels on top of the C malloc() library: a raw memory allocator, a generic object allocator, and a type-specific allocator.

Raw Memory Allocator

Object Allocator

The object allocator lives in Objects/obmalloc.c. Allocations are rounded up to the nearest 8-byte blocksize and categorized by size, up to the large-object threshhold of 256 bytes. There are 33 different size classes, classes 0-31 corresponding to block sizes 8-256, and then a final class holding blocks > 256 bytes.

Type-specific Allocators

Allocation of course must be followed by constructor/initialization code. This magic is found in Objects/object.c/_PyObject_New(), which takes a type object, malloc's the correct space for an instance, and calls the init() for that instance.

By inserting instrumentation into _PyObject_New(), it is possible to report memory allocations by type and size. Warning: memory allocations are a frequent event! Performance may be significantly impacted!

Visualizing Memory Allocation

The complexity and importance of this subsystem make it a ripe target for visualization.

Garbage Collection

Historically, Python used reference counting to perform its garbage collection, causing problems in the small fraction of programs that created cyclic data structures. Eventually, Python followed the same evolutionary path as Lisp, and added cycle detection and collection.

However, Python remains primarily a reference-counting system. One can expect the biggest impact of this decision is the performance loss associated with pervasive Py_INCREF and Py_DECREF macros throughout the implementation, especially adding two memory reads/writes to every assignment. A lesser side-effect is the addition of a reference count field to every heap object in the system.