Bloom Filters and Beyond: An Illustrated…

Sep 5, 2023

Mastering the Art and Science of Bloom Filters

11 Comments

Sep 26, 2023

If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.

Is that the intention? That doesn't make sense to me.

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 27, 2023

You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.

However, I have improved it a bit by adding more hash functions. Thanks for noticing!

Expand full comment

Reply (1)

Kyle Monson

Sep 28, 2023

OK, I had thought you had intended to append the seed to what was about to be hashed to produce pseudo random result.

I was expecting the seed to get like this (using the same hash algorithm with every call):

hasher.update((data+str(seed)).encode('utf-8'))

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 28, 2023

Yeah, that was also an option. This one didn't cross my mind.

Expand full comment

Neo Kim

Sep 16, 2023Edited

Hey Abhinav,

great post! I've an article about bloom filter on my private website. But I really liked your version with Python code.

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 17, 2023

Thank you, NK! I appreciate it.

BTW, I was just checking out your substack. You are performing phenomenally well :)

Expand full comment

Reply (1)

Neo Kim

Sep 17, 2023

Thank you. I think your newsletter has amazing content. I love the fact that you're also into Python and Unix. Great stuff.

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 17, 2023

Thank you! Those are my favorite topics :)

Btw, I recommended your substack. If you find my substack interesting, I would appreciate if you recommended mine as well. :-)

Expand full comment

Wyrd Smythe

Jun 13, 2024

Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?

And now I finally get this xkcd comic: https://xkcd.com/2934/

Expand full comment

Erffrfez

Sep 23, 2023

The math equation has an extraneous / at the end.

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation} \)

I believe should be

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation})

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 23, 2023

Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.

Expand full comment