CPython Object System Internals: Understanding the Role of PyObject
Understand how objects are implemented in CPython and how CPython emulates Inheritance and Polymorphism in C using struct embedding
In the past, I’ve written a few articles covering the internals of CPython and they have been received quite well. I want to write more in this area, but in a more systematic way, so that we have a solid foundation when we dive deeper in future articles.
In this article, I plan to cover a basic idea behind how objects (or the data types) are implemented and represented within CPython. If you look at the CPython code, you will see a lot of references to PyObject, it plays a central role in the implementation of objects in Cpython. This is a crucial detail to understand, and if we decode this part, everything else becomes simpler. So, let’s do this!
Understanding Object Implementation: Why Polymorphism Matters
A programming language like Python supports various kinds of data types, such as integers, floats, bool, strings, dicts, lists, user defined classes (just to name a few). When implementing the language, we want to do it in such a manner that the core engine is agnostic of the knowledge of the different object types supported by the language. This keeps the code clean, and ensures that the implementation details of the types do not leak out and pollute the rest of the compiler code.
How can we achieve this? One idea is to leverage polymorphism. For instance, we could declare an interface to define the common behavior of all the data types supported by our language, and every data type of our language implements this interface. The following Java code shows an outline of such an approach:
public interface DType {
    // adds two objects
    DType add(DType a) throws NotSupportedException;
    // multiplies two objects
    DType mult(DType a) throws NotSupportedException;
    ...
}
// the integer data type
public class IntDType implements DType {
  // implementation not shown
}
// the float data type
public class FloatDType implements DType {
  // implementation not shown
}
This approach ensures that the implementation details of the data types remains encapsulated within them. When implementing the rest of the language, we can create a unified interface between different modules where they exchange data between them in terms of objects of type DType. As an example, let’s see how we might implement the virtual machine taking advantage of this unified interface for representing the objects.
Most interpreted programming languages implement a virtual machine (VM) for executing their code. These virtual machines support instructions just like real hardware-based machines, but they are implemented in software. Python uses a stack based VM which uses a stack for storing and retrieving data while executing the instructions.
The following code shows how the stack virtual machine (VM) implementation might look like in Java.
import java.util.Stack;
public class VM {
    private Stack<Dtype> stack = new Stack<DType>();
    public DType execute(Bytecode code) throws RuntimeError {
        // We decode instructions from the bytecode and execute them
        // one by one
       while (!endOfInstructions(code)) {
           Opcode opcode = decodeNextInstruction(code);
           switch (opcode) {
               case BINARY_ADD:
                   DType lhs = stack.pop();
                   DType rhs = stack.pop();
                   DType result = lhs.add(rhs);
                   stack.push(result);
                   break;
               // more cases after this not shown
               ...
            }
       }
    }
}
As you can see, the stack stores and returns objects of type DType
, and the VM simply calls the interface methods on the DType
objects without worrying about the concrete types of those objects. This approach significantly simplifies the implementation of the VM.
CPython follows the same design. However, as it is implemented in C which is not an object oriented language, a clean design like this becomes a challenge. But, there is a very simple technique to mimic inheritance and polymorphism in C, which is what CPython employs. Let’s understand how.
How CPython Simulates Inheritance & Polymorphism to Implement Objects
Using inheritance in an object oriented language requires us to define a parent class with all the common attributes and behaviors that should be inherited by all of its child classes. To achieve the same effect in C, we need to define a parent struct which contains all the common attributes.
This is where the PyObject
struct comes into the picture. CPython defines a struct called PyObject
for this purpose. It is defined in the file Include/object.h, and shown in the following figure.
The
PyObject
struct contains two fields (at the time of writing this article):ob_refcnt: This field reflects the number of references to this object. Python uses reference counting as a mechanism for managing the memory of its runtime. I have written a very detailed article on how reference counting works in CPython, you should check it out for more details.
ob_type: This field is a pointer to an object which contains type related information about the object, such as the name of the type, amount of memory needed to allocate an object of this type. Additionally, it also maintains a table of function pointers, called the method table which includes implementation for type specific behaviors such as numeric operations for numeric types, sequence methods for collection types, etc.
The next part of inheritance is implementing the children such that they inherit the parent's attributes. In an object oriented language this is automatically achieved using the language’s syntax. However, doing this in C requires a bit of manual work.
To simulate inheritance between two structs, we need to embed the parent struct as the first field in the child’s definition. Let’s look at a couple of examples of this from CPython.
The following figure shows the definition of the PyFloatObject which defines the float datatype in CPython:
As you can see, the first field here is an instance of PyObject
, while the second field is the value of the object itself. Similarly, the following figure shows the definition of the complex type in CPython:
This pattern can be seen in the implementation of all the datatypes supported by CPython and many other key internal objects within the CPython machinery.
All of this is fine but it may still not be clear how the embedding of the PyObject
struct as the first field simulates inheritance? Let’s discuss that in the next section.
How Does Struct Embedding Simulate Inheritance?
By embedding the PyObject
struct as the first field in every object’s definition, CPython ensures that their in-memory representation starts with a common header which is formed by the PyObject
itself. This is evident in the following figure:
Now, achieving the ‘is-a’ relationship of inheritance simply requires us to switch the interpretation of these bytes from the child type to the parent type, i.e., PyObject
, which can be done via typecasting. The following figure illustrates what happens when we typecast these objects to PyObject
type:
In essence, CPython simulates inheritance by maintaining a consistent memory layout for the beginning of all objects using the PyObject
header.
Moreover, the ob_type
field within the PyObject
struct points to a PyTypeObject
, which contains a table of function pointers, referred to as the type's method table. This table implements behaviors specific to the type, such as allocation, deallocation, addition, representation, length retrieval, among others. By providing a unique implementation of these function pointers for each object type, CPython achieves polymorphic behavior.
Let’s see how this simulated polymorphism works by dissecting instruction execution in the CPython VM.
Polymorphism in Action: Inside The CPython Virtual Machine
At this point we understand how CPython simulates inheritance and polymorphism using the PyObject
header. To get a concrete understanding, let’s see how this comes handy in the implementation of the VM.
The following figure shows how the CPython VM might execute a simple function which adds its parameters. The left hand side shows the function definition and its bytecode, while the right hand side steps through the bytecode instructions, and shows the state of the stack after having executed the current instruction.
This is a simplified view but highlights the parts we are interested in. During the execution of the instructions, the VM has three stores for storing and retrieving the data. These are the stack, the locals table, and the globals table.
The locals table stores all the objects which are local to the current scope of execution, whereas the globals table stores the globally visible objects. The stack is used by the instructions for temporary storage of data for their operation. All of these three store objects of type PyObject *
, which allows them to seamlessly work with each other.
In this example function, since we are adding the two parameters of the function, those parameters are local to the current scope and are therefore stored in the locals table.
So far so good. We need to look inside the execution of the LOAD_FAST
and BINARY_OP
instructions because all the actions is there. Out of these, the BINARY_OP
instruction is the interesting one where polymorphism comes into play. But for the sake of completeness we will discus the implementation of both LOAD_FAST
as well.
Execution of the LOAD_FAST Instruction
The following figure shows the implementation of the LOAD_FAST and annotates the relevant parts:
The LOAD_FAST
instruction’s job is to load an object from the locals table and push it onto the stack. To do this, the instruction needs to know the index of the object in the locals list. This index is provided as an argument in the instruction bytecode. In the CPython code, the oparg
variable holds these arguments. Note that because both the locals list, and the stack store objects of type PyObject *
, this whole operation becomes very simple.
Next, let's look at the implementation of the BINARY_OP
instruction.
Execution of the BINARY_OP Instruction
The following figure shows the implementation of the BINARY_OP
instruction:
The BINARY_OP
instruction pops the top two values from the stack and performs the binary operation on them. The result of the binary operation is again an object of type PyObject *
, which is pushed as the top value onto the stack, and ultimately, that becomes the return value of this function.
To execute the correct binary op, the VM contains a table of function pointers called binary_ops
(see the file ceval.c), which is indexed by the opcode of the binary operation.
The functions in this table themselves don’t perform the binary op because the VM code is type agnostic. Instead, these functions simply dispatch the execution of the op to the underlying type’s implementation.
As you might recall, the ob_type
field in the PyObject
header points to a structure which contains a table of function pointers. Each type implements the functions for the operations it supports and populates this table. And, the VM code is simply dispatching the execution to these functions via the object header without ever knowing the concrete type of the objects. This is how the PyObject
header simulates polymorphism in the CPython object machinery.
Although showing how all of this happens inside the CPython VM code is impractical, the following hypothetical code demonstrates the method dispatch logic:
/**
 * This is not actual code from CPython
*/
PyObject * PyNumber_Add(PyObject *v, PyObject *w)
{
    binaryfunc slotv;
    // if v and w are not numeric types, we return NotImplementedError
    if (v->ob_type->tp_as_number == NULL)
        return Py_NotImplemented;
    if (w->ob_type->tp_as_number == NULL)
        return Py_NotImplemented;
    // look up the implementation of binary add op in v
    slotv = v->ob_type->tp_as_number->nb_add;
    PyObject *res = slotv(v, w);
    return res;
}
Key Takeaways
And that covers everything I wanted to cover about the role of PyObject
in CPython’s object system implementation. Let’s summarize the key takeaways:
Many programming language implementations use polymorphism and inheritance for implementing their object system, which results in a unified interface between different modules of the compiler and runtime.
As CPython is implemented in C, it employs a simple trick to mimic inheritance and polymorphism in C. This trick involves creating a parent struct definition and embedding it as the first field in the definition of every child struct. This results in each object having a unified memory layout for its header, enabling the interpretation of the bytes as the parent struct seamlessly.
In case of CPython, the
PyObject
struct plays the role of the parent struct. It consists of two fields: the object reference count, and the type related data about the object.The rest of the CPython implementation mostly works with objects of type
PyObject
. The CPython VM dispatches calls to the type specific methods via the function pointer table present in the object header.
Wrapping Up
This article sets up a good foundation for us to cover some more interesting details of the CPython compiler and runtime, which we shall do in the coming days. I hope you enjoyed reading it. If you have any questions, or suggestions, let me know through comments or replies.
This reminds me how the first C++ compilers at AT&T translated C++ source code into C source code, presumably using the same methods you’re describing to implement polymorphism and inheritance. If you’re curious see Design and Evolution of C++ by Stroustrup.