If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.

Is that the intention? That doesn't make sense to me.

You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.

However, I have improved it a bit by adding more hash functions. Thanks for noticing!

Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?

Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.

If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.

Is that the intention? That doesn't make sense to me.

You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.

However, I have improved it a bit by adding more hash functions. Thanks for noticing!

OK, I had thought you had intended to append the seed to what was about to be hashed to produce pseudo random result.

I was expecting the seed to get like this (using the same hash algorithm with every call):

hasher.update((data+str(seed)).encode('utf-8'))

Yeah, that was also an option. This one didn't cross my mind.

edited Sep 17, 2023Hey Abhinav,

great post! I've an article about bloom filter on my private website. But I really liked your version with Python code.

Thank you, NK! I appreciate it.

BTW, I was just checking out your substack. You are performing phenomenally well :)

Thank you. I think your newsletter has amazing content. I love the fact that you're also into Python and Unix. Great stuff.

Thank you! Those are my favorite topics :)

Btw, I recommended your substack. If you find my substack interesting, I would appreciate if you recommended mine as well. :-)

Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?

And now I finally get this xkcd comic: https://xkcd.com/2934/

The math equation has an extraneous / at the end.

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation} \)

I believe should be

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation})

Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.