Bloom Filters and Beyond: An Illustrated Introduction and Implementation
Mastering the Art and Science of Bloom Filters
During my explorations of the Python internals (see reference counting internals and immortal objects), I found a fascinating detail — the creative use of the bloom filter in its implementation of the string data structure. A bloom filter, is a brilliant and efficient data structure that swiftly searches large datasets while minimizing memory usage.
Recognizing that some of my readers may be new to this concept, I've decided to divide this into a two-part series. The first part serves as primer on bloom filters, focusing on their operation, their implementation, and a more advanced topic — the counting bloom filter.
So, there's something in this article for everyone. If you're unfamiliar with bloom filters, this will serve as a solid foundation. For those who know the basics but haven't yet implemented one, this discussion on the mechanics will prove helpful for the next article on Python’s internals. To my readers who are bloom filter veterans, stick around for the discussion of counting bloom filters — it's bound to be enlightening. Enjoy the read!
Understanding the Basics of Bloom Filters
A Bloom filter is a probabilistic data structure used for membership queries in large data sets, i.e., whether an item is part of the data set or not. The answer from the Bloom filter can either be “the item definitely does not exist” or “it probably exists”. Notice the emphasis on the word "probably"; this means that when the Bloom filter returns a positive result, it is not always accurate and there is some chance of a false positive. However, when it produces a negative answer, it is 100% correct.
Bloom filters are typically used in big data applications where keeping all the data in memory is unfeasible. They offer constant time lookup performance while consuming only a fraction of the memory of the actual data set. The next section will discuss their implementation to understand how they achieve this.
Exploring the Mechanism of Bloom Filters
A Bloom filter uses a bit vector for maintaining data item membership. The length of the vector is proportional to the number of items expected to be stored in it.
To be memory efficient, the Bloom filter does not store the actual items in the vector. Instead, generates a k-bit long bit signature for each item and sets the corresponding bits on in the bit vector. To generate the k-bit bit signature, it uses k hash functions.
Each of these k hash functions should be independent and should uniformly distribute the values across the bit vector. In other words, two different items should very rarely result in the same bit signature. Ideally, we want to use computationally fast hash functions with a low collision rate. Some of the commonly used hash functions are MurmurHash, Jenkins hash, City hash, and xxHash.
The following listing shows the definition of the BloomFilter
class in Python
import hashlib
class BloomFilter:
def __init__(self, size: int, nhash: int):
self.size: int = size
self.nhash: int = nhash
# Number of bytes required to store size number of elements
num_bytes: int = (size + 7) // 8
# Initialize a bytearry with all zeros
self.bit_vector = bytearray(([0] * num_bytes))
self.hash_functions = [hashlib.sha256,
hashlib.sha1,
hashlib.md5,
hashlib.sha384,
hashlib.sha512]
def _hash(self, data: str, seed: int) -> int:
hasher = self.hash_functions[seed % len(self.hash_functions)]()
hasher.update(data.encode('utf-8'))
digest = int(hasher.hexdigest(), 16)
return digest % self.size
Here, we used Python's bytearray object, which allocates a given number of bytes. As a byte is the smallest amount of addressable memory in computers, the only way to create a bit vector is to create a byte array and then access individual bits within it using bitwise operations.
We have used a bunch of cryptographic hash functions in this implementation, but in real-world implementation you might want to use faster hash functions such as MurMur hash.
We will flesh out the remaining parts of BloomFilter
in the next couple of sections, where we discuss the add and query operations of Bloom filters.
Adding an Item to a Bloom Filter
When trying to add an item to a bloom filter, there are several steps we must undergo which involve manipulating the bit vectors. Let's break this process down into easy-to-understand steps.
Algorithm for Adding a Data Item
Let's introduce the algorithm that delineates the process of adding a data item x
to a bloom filter.
Algorithm: Adding an item to the bloom filter
Input: bloom filter with k hash functions, and item x
1. for h in hash_functions:
2. i = h(x)
3. filter[i] = 1
This algorithm is fairly simple. Each hash function returns an index position within the bit vector which we need to set to 1 to mark the item as a member of the dataset.
An Illustrative Example
Imagine storing the string “apple”
in a bloom filter of length 8 using 3 hash functions. The illustration below depicts how this would work in practice:
Python Implementation
Now, let’s translate this process into the Python language to perform the add operation:
def add(self, item: str) -> None:
for i in range(self.nhash):
index = self._hash(item, seed=i)
byte_index, bit_index = divmod(index, 8)
mask = 1 << bit_index
self.bit_vector[byte_index] |= mask
Adding an item to the Bloom filter involves steps such as computing the byte index and bit index and updating the bit vector. Let's understand these individual steps in more detail.
Computing the byte index and bit index
A central part of the 'add' operation involves determining the byte index and bit index for the given item.
Looking at the diagram above, we can see that we need to firstly identify which byte contains the intended bit for storing the representation of our item. This is achieved by simply doing integer division of the index value by 8, hence providing us with the byte index.
Secondly, inside that byte, we need to figure out the specific bit to be set. The modulus operation (index % 8
) gives us the bit index within the byte.
Updating the bit vector
Once we've located the specific bit, the next step involves modifying its value in the bit vector. Now, because a byte is the smallest addressable unit of memory, we can't straightforwardly change a single bit in Python. Instead, we'll need to use a "bit mask."
A "bit mask" is simply a sequence of bits that we use as a 'mask' to compare, or alter, the bits in our bit vector byte. For the add operation, we want to create a bit mask that has a 1 in the positional index we want to change, and 0s everywhere else.
mask = 1 << bit_index
In this line, we're generating our bit mask. In binary, the number one has all bits set to 0, except for the rightmost one (e.g., '00000001'). By left shifting it by bit_index
positions, the '1' moves to the same position as bit_index
, creating the bit mask we need.
So, how do we use this bit mask to update our bit vector? We use a bitwise OR operation.
self.bit_vector[byte_index] |= mask
The bitwise OR of the byte and the mask ensures that the bit at the desired position is turned on, while all other bits remain unchanged. That way, we've added our item to the bloom filter in exactly the right spot without disturbing the other bits.
Querying a Bloom Filter
Once we’ve added items to a bloom filter, we would often want to then query whether an item exists in our filter or not. This section explains the process of querying a bloom filter.
Algorithm for Querying in Bloom Filters
Algorithm: Querying if an item x exists in the bloom filter
Input: bloom filter with k hash functions and item x
Output: True if item may exist, False otherwise
1. for h in hash_functions:
2. i = h(x)
3. if filter[i] != 1:
4. return False
5. return True
This algorithm is straightforward: it merely iterates through all the hash functions and checks if the related bits are turned on.
Python Implementation
def query(self, item: str) -> bool:
for i in range(self.nhash):
index = self._hash(item, seed=i)
byte_index, bit_index = divmod(index, 8)
mask = 1 << bit_index
if (self.bit_vector[byte_index] & mask) == 0:
return False
return True
This Python implementation checks if the bit at bit_index
position is set in the filter's bit vector for each hash function, and quickly returns False
the moment it finds a bit that isn’t set — a clear indicator that the queried item has never been added to the bloom filter.
It uses the bitwise AND operation to compare the mask with the byte from the byte array. The bitwise AND compares two bit patterns and results in a new pattern where a bit is set to 1 only if the corresponding bits were set in both the patterns.
In our case, the bit mask has only the bit at the bit_index
position set to 1. Consequently, if this corresponding bit in the selected byte is not set to 1, the result of the bitwise AND will be 0, as shown in the following illustration.
This section should have provided a deep understanding of how querying works in bloom filters. This also sets the stage for us to understand why it can yield false positive results. Let’s discuss that in the next section.
Understanding False Positives in Bloom Filters
When we are storing an item in the bloom filter, we are turning specific bits on in the bit vector. When querying for that item, if any one of those bits is not on, that clearly means that this item was never stored in the filter.
However, if all the bits are on for that item, there is no guarantee that these bits were set by that same item. It’s also possible that these bits were turned on while adding different items. This can happen because the hash functions can map multiple items to the same indices in the bit vector.
In the illustration above, we can see the state of the bloom filter after storing two items in it. Let's say at this point we make a query for the item “banana”
in the filter — the values returned by the hash functions are as follows:
h1("banana") = 3
h2("banana") = 2
h3("banana") = 6
The three hash functions generate indices 3, 2, and 6 respectively. All of these bits are set to 1 in the filter. Does this mean that “banana”
was stored in the filter? We know that’s not the case. The bits 3, and 6 were turned on for storing “apple”
while the bit 2 was turned on for storing “orange”
. This is clearly a false positive.
If you are concerned about false positives — don’t worry. The rate of false positives can be controlled by tuning the size of the bloom filter and the number of hash functions. Let’s take a look at that next.
Tuning a Bloom Filter
To achieve maximum performance with a bloom filter, we need to carefully tune its parameters. This section provides formulae for tuning the false positive rate, size, and number of hash functions of a bloom filter.
I will not show the derivations of these formulas here to keep the article length in check. However, for those interested in the math, check out this article.
False Positive Rate of a Bloom Filter
For a bloom filter of size m, with k hash functions and expected number of items to be stored as n, the approximate false positive rate can be calculated using the following formula:
Based on this formula, we can see that the false positive rate decreases as we increase the size of the bloom filter (m) and it increases as the number of elements inserted in the bloom filter (n) increases.
Optimal Number of Hash Functions for a Bloom Filter
For a bloom filter of size m, with k hash functions and n number of items to be inserted, the optimal value of k can be computed using the following formula:
Optimal Size of a Bloom Filter
Finally, the following formula can be used to compute the optimal size of a bloom filter to achieve a false positive rate ϵ
for storing n number of items:
All of this discussion is useless, if we don’t know in what situations bloom filters are used. Let’s discuss some of their applications in the real-world.
Applications of Bloom Filters
Bloom filters come into play when constraints, such as data size, hinder the storage of an entire dataset in memory. They efficiently flag non-existent items, saving costly processing. In the case of false positives, a fallback option—e.g., searching the actual disk data— counters this issue. With that in mind, let’s look at couple of real-world usages of bloom filters.
Bloom Filters In Database Indexes
Databases such as Cassandra and RocksDB use log-structured merge-trees (LSM) for indexing data, which are stored in disk files, known as SSTables. A database query prompts a search through all SSTables. To accelerate this process, a bloom filter is attached to each file. Since a key only exists in one file, the bloom filter can efficiently dismiss most files. If the bloom filter indicates a possible match, the corresponding file is searched. To minimize false positives, the bloom filter is fine-tuned.
Bloom Filters In Distributed Caches
A distributed cache splits contents across several nodes. The cache's master node directs the query to the correct node containing the data, without storing which node holds which keys. This is done using a bloom filter for each node. If a lookup request reaches the master node, it refers to each node’s bloom filter. The master node routes the request to any nodes possibly holding the queried data.
So far we have seen the two basic operations of bloom filters — adding and querying items. However, it is not possible to delete an item from a bloom filter. Notably, there is an extension called the Counting Bloom Filter which does have the ability to delete an item. Let’s see how it works.
Enabling Item Deletion in Bloom Filters with Counting Bloom Filter
A traditional bloom filter doesn't provide a way to delete an item. If we want this functionality, we would need some way to remember which items were stored in it. However, we want to achieve that without actually storing the original data in it, otherwise it defeats the purpose of a bloom filter. We can do this by using an array of counters for each bit in the bit vector, i.e., if our bit vector is of size m, we need an array of counters of size m. This is also called a counting bloom filter.
Adding an Item to a Counting Bloom Filter
Whenever we add a new item to the counting bloom filter, we will first increment the counter for the corresponding bit position and we will turn on the bit in the bit vector only if its counter value is 1 after the increment. The following listing shows the updated algorithm for adding an item in this scheme.
Algorithm: Adding an item to the bloom filter
Input: bloom filter with k hash functions, and item x and m counters
1. for h in hash_functions:
2. i = h(x)
3. counters[i].increment()
4. if counters[i] == 1:
5. filter[i] = 1
The algorithm for query remains unchanged in this scheme, so we will not discuss it. However, we can now delete an item. Let’s discuss that next.
Deleting an Item from a Counting Bloom Filter
As you might have seen, when adding a new item to this bloom filter, we are counting how many times each bit position has been turned on. Therefore, when deleting an item, we are going to decrement the counters for each of the bit positions for that item. When a counter reaches 0, we set the corresponding bit index in the bit vector to 0 as well. At this point, if a query for that item comes to the bloom filter, it will return false. Here is the algorithm for that.
Algorithm: Deleting an item from the bloom filter
Input: bloom filter with k hash functions, and item x and m counters
1. for h in hash_functions:
2. i = h(x)
3. counters[i].decrement()
4. if counters[i] == 0:
5. filter[i] = 0
Counting bloom filters are a simple extension of regular bloom filters. However, there are few weaknesses which we need to be aware of when using them.
Weaknesses of Counting Bloom Filter
In case of high number of hash collisions, the same counter may be incremented for several items. Therefore deleting an item may not always mean that we will get a negative result when querying for it later.
We need to be careful with overflow of the counters. If storing a large number of items in the bloom filter, we need to use a large enough counter to be able to count that many items.
Finally, the use of counters requires extra memory overhead.
Reader Challenge
I am not going to show the implementation of the counting bloom filter. I leave it as a challenge for you. Bonus points if you implement it in a language other than Python. Share the link of your code in the comments. I would love to check it out!
Resources
You can find the bloom filter implementation shown in this article on my github.
Conclusion and Looking Ahead to Part 2: Python’s Uses of Bloom Filters
In this article we covered bloom filters in great detail along with their practical applications and an implementation in Python. We also discussed counting bloom filters, which allow deletion of items as well. Those who are working on high performance and big data systems will find this as a handy tool in their toolkit.
However, the curiosity piques when we consider how widely used and popular languages such as Python incorporate these sophisticated data structures. In the next article of this series, I will show you how Python implements and uses bloom filters in its implementation of strings. Not only is that a clever implementation, but you will also learn a bit about how strings work internally in Python, so stay tuned!
As always, leave your feedback in comments, like the article and share it with your peers.
If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.
Is that the intention? That doesn't make sense to me.
Hey Abhinav,
great post! I've an article about bloom filter on my private website. But I really liked your version with Python code.