How CPython Implements and Uses Bloom Filters for String Processing
Inside CPython's Clever Use of Bloom Filters for Efficient String Processing
In our last discussion we learned all about bloom filters. It’s a unique data structure that provides membership queries in constant time while using a minimal quantity of memory. Primarily, you will find them being used in large scale and streaming applications where it is infeasible to keep all the data in memory. Examples include NoSQL databases, CDNs, load balancers, etc. However, it also has uses in some unexpected places. For instance, Python uses them in some of its string processing APIs. As string processing is one of the most common tasks in real-world code, it has to be fast. Therefore, the situations in which Python has used bloom filters and the way it has implemented them makes for an excellent case study.
In this article we will examine in detail the places where CPython has used bloom filters. We will also cover the specific implementation detail of the bloom filter inside the string data structure of CPython and analyze how it works. So let’s get going!
Python vs CPython: I am strictly talking about the details of CPython implementation in this article and might use Python and CPython interchangeably.
CPython Version: All the discussion and code shown in this article refers to the release 3.12 of CPython.
The Role of Bloom Filters in Python Strings
Before diving into how Python implements and uses bloom filters, it is important to understand why Python uses them. It will motivate and add more context when we examine CPython’s internals to understand the bloom filter implementation.
As we are focusing on strings, and strings are comprised of individual characters, the obvious use case of bloom filters is to check if a character in a string is part of a larger set of characters. Let’s get more specific.
How splitlines() API Utilizes Bloom Filters
One place where Python uses bloom filters is in the implementation of the splitlines() API when splitting a string by line breaks. This would have been simple if a line break could only have a single representation (e.g., ‘\n
’). However, as it turns out, there are multiple representations for line breaks. For example, in Unicode, there are eight different code points that can be used to represent a line break. The following listing from CPython shows them.
const Py_UCS2 linebreak[] = {
0x000A, /* LINE FEED */
0x000D, /* CARRIAGE RETURN */
0x001C, /* FILE SEPARATOR */
0x001D, /* GROUP SEPARATOR */
0x001E, /* RECORD SEPARATOR */
0x0085, /* NEXT LINE */
0x2028, /* LINE SEPARATOR */
0x2029, /* PARAGRAPH SEPARATOR */
};
The splitlines()
API must analyze each character of the input string and check if it's one of these eight possible line break characters. We'd prefer not to do eight comparisons for each character in the string, as that would be too slow.
A faster option would be to use a Set. However, you need to compute hashes to do lookups in a set, and even that can get expensive when we have to do it so frequently.
Instead, the CPython developers use a bloom filter. Yes, bloom filters also need to use hash functions (and usually more than one), but the CPython developers have implemented the bloom filter in a way that circumvents costly hash functions. It's too early to describe how they achieve that; wait until we get to the implementation of the bloom filter.
The following illustration shows the implementation of the splitlines()
API from CPython code along with relevant lines annotated with explanations.
You can find the full implementation of the splitlines API here.
The Application of Bloom Filters in the strip() API
Another place where Python has used bloom filters is in the implementation of the strip() API. The strip() API is used to remove a specified set of characters from the beginning, ending, or both ends of the string. The following image shows its docstring along with examples from the official Python documentation.
As we can see, the number of characters to be stripped off is not fixed, and we can specify as many as we want. The strip()
API begins by reading characters from the string's start and checks each character to see if it should be stripped.
As this set of characters to be stripped is potentially unbounded, we can't do a one-by-one comparison with each member of the set — it will be too expensive. Instead, using a bloom filter will give a faster result. In most cases, the result is going to be negative, and a bloom filter can give a 100% accurate negative result, thus saving unnecessary comparisons. In the odd case when it does give a positive result, we need to do an actual comparison of the character and see if it matches with one of the characters in the "to-be-stripped-off" set of characters.
The following illustration shows the implementation of strip() from CPython code and annotates the parts that create and query the bloom filter.
With this added context about where and how Python uses bloom filters, we can move on to see how CPython has implemented the bloom filter, and how have they avoided use of expensive hash functions. Let’s see this in the next section.
Breaking Down CPython’s Bloom Filter
The bloom filter implementation in CPython is not our typical bloom filter; it is different and can be difficult to recognize at first glance. Therefore, instead of directly jumping to the CPython code, we will first break down the essential details of its bloom filter implementation, which will make it easier to grasp the C code when we see it.
A bloom filter implementation needs two things: a bit vector to store the items and hash functions to determine which indices in the bit vector need to be set for adding an item. Let’s first see the bit vector representation for this bloom filter implementation.
Bit Vector Definition for the Bloom Filter
As a byte is the smallest amount of addressable memory in a computer, we cannot directly create a bit vector. Typically, we determine how many bytes are required to represent the bit vector of a given size and then create a byte array of that many bytes. In the bloom filter article, that's what we did.
In the case of CPython, they don't need a very large bit vector. Its primary use of the bloom filter is to check whether a specific character in a string is amongst a larger set of characters. For instance, if a character is one of the many variants of line break characters.
For storing characters of a string, a bit vector of size 32-64 (4 to 8 bytes) suffices. Allocating a byte array on the heap for such a small number of bytes is overkill, especially when considering that string processing functions such as strip()
and splitlines()
can be called very frequently in applications.
Instead, CPython uses an unsigned long
type integer to represent the bit vector. On 64-bit hardware a long
integer type is represented by 64 bits, thus giving us a bit vector of size 64. And, it does not require heap allocation to create it.
Using an unsigned long
for the bit vector makes the process of storing items or querying items in the bit vector fast as well. For example, if we want to see if a certain set of bits are on in the bit vector (when querying the bloom filter), we can do this efficiently in a single comparison using a mask. For example:
// create a bloom filter with characters a,b, and c added to it
unsigned long bloom_filter = create_bloom_filter("abc");
// create a mask for checking characters 'a', and 'b'
unsigned long mask = create_mask("a");
mask |= create_mask("b");
// check if 'a' and 'b' are present in the bloom filter
bloom_filter & mask == mask;
Adding Items to the Bloom Filter
Since CPython is using the bloom filter as part of string processing functions, the bloom filter must be fast. Using hash functions to compute indices in the bit vector is expensive in this context. Therefore, CPython employs a clever workaround.
CPython represents strings as UTF-8 encoded strings, where each character in the string can be encoded using one to four bytes. To add a character to the bloom filter, we select the lower n-bits of the character, which represents a number between 0 and 2^n - 1 in decimal, and set the corresponding index on in the bit vector. The listing below shows how this works exactly.
How does it determine how many bits to use? It uses the width of the long
datatype of the platform to decide. For example, on 64-bit hardware, a long
value is represented by 64-bits, and 6-bits are needed to represent values up to 64 (since 2^6 = 64), so it would use the lower 6-bits of the character to add an item to the bloom filter. If a long
value is represented by 32-bits, only 5-bits would be used in the same process.
The following illustration shows a simplified version of the code for adding a character to such a bloom filter. It defines the constant BLOOM_WIDTH
as 64, which means that the lower six bits of a character will be used to add it to the bloom filter.
This code capitalizes on the fact that every number that immediately precedes a power of 2 is represented by all 1’s in binary. For instance,
64
is a power of2
(64 = 2^6
), therefore63
is represented as ‘00111111
’ in binary. Doing a bitwiseAND
of 63 with any other value will select its lower six bits.
Locating a Character in the Bloom Filter
The procedure to query the bloom filter for the presence of a character is similar. We use the lower n-bits of the character to find the index to check in the bloom filter. The following listing features an illustration of how the mask computation and querying work.
The code shown in the above listings is a simplified version of the bloom filter implemented in CPython. If you understood that, then the CPython code should easily follow. Let’s take a look at it in its full glory!
Inside the Bloom Filter of CPython
You can find the bloom filter implementation in the file Objects/unicodeobject.c. This file implements most of the string APIs in CPython along with logic for encoding/decoding strings to and from various Unicode formats such as UTF-8, UTF-16, and UTF-32. The following illustration lists the full bloom filter code from it along with annotations to explain the important parts.
Let’s discuss the main parts of this implementation:
The macro BLOOM
is used to query a bloom filter for checking the presence of a character. This is almost identical to the function is_char_in_bloom()
we saw in the previous section, except here it is presented in the form of a macro.
The function make_bloom_mask
creates a new bloom filter, adding all characters from the string represented by ptr
. Inside this function, the BLOOM_UPDATE
macro runs the heavy-duty task of iterating through the string's characters and adding them to the bloom filter. It is just an extended version of the add_to_bloom_filter()
function we saw above.
If you're finding the switch case in make_bloom_mask
puzzling, you're not alone. To understand it, let's recall that Python uses UTF-8 encoding for strings. With UTF-8, each character can take 1-4 bytes. CPython can represent a string as one of three possible types, depending upon the maximum size of a character in the string:
PyUnicode_1BYTE_KIND: Each character is represented by one byte in these strings and the type of the character is
Py_UCS1
which is auint8_t
(8-bit unsigned integer) under the hood.PyUnicode_2BYTE_KIND: Each character is 2 byte wide and the type of the character is
Py_UCS2
which is auint16_t
(16-bit unsigned int) underneath.PyUnicode_4BYTE_KIND: Each character in the string is represented by 4 bytes, and the type of the character in these strings is
Py_CS4
which isuint32_t
(32-bit unsigned int).
In the switch case, the type of string is determined, and the correct pointer type passed to the BLOOM_MASK
macro. If you're interested in seeing the string types' actual definitions in the CPython source code, you can find them here.
That was the full bloom filter implementation for handling strings in CPython in less than 50 lines of C code. If some parts of still look complicated, I suggest going back to the simplified versions I showed in the previous section, and understanding those better. Even if the CPython version is not 100% clear at this moment, it’s fine. As long as you understand the basic idea, you can apply it anywhere.
Key Takeaways
Python uses Bloom filters in its string data structure, to improve performance of certain string processing functions, such as strip and splitlines.
Bloom filters in Python are used to avoid resource-intensive individual comparisons and compute hashes.
CPython’s Bloom filter implementation is unique; it uses an unsigned long type value to represent the bit vector, streamlining storing/querying items in the bit vector.
In the process of adding to the Bloom filter, CPython determines the position in the bit vector by using the lower n-bits of the character. This resulting value is used to set the corresponding index in the Bloom filter to 'on'
When querying the Bloom filter, CPython uses a similar process. It determines the position in the bit vector by deriving the lower n-bits from the character. Then it checks whether the corresponding index in the Bloom filter is 'on'. If so, the character is possibly present.
Conclusion — A Close on Python's Bloom Filters
This in-depth exploration of how Python cleverly uses Bloom filters for its string data structure has given us a closer look at the workings of the language. This is just one of the many intriguing data structures Python employs, offering a glimpse of the language's capabilities under the hood. Stay tuned for future explorations into other elaborate structures like Hash Array Mapped Tries and their applications within Python. Let me know what you'd like to learn more about in the comments.
Support My Work
Enjoyed the article? Express your appreciation through likes and shares — they motivate me to produce more in-depth content. Also, consider pledging for my upcoming paid subscriptions to support my work. However, no charges will apply until the I turn paid subs on.
This is very interesting.
I think this is incorrect:
// check if 'a' and 'b' are present in the bloom filter
bloom_filter & mask != 0;
I believe it's supposed to be this:
// check if 'a' and 'b' are present in the bloom filter
bloom_filter & mask == mask;
Awesome article - shared it with people on my slack channels