11 Comments
Sep 26, 2023Liked by Abhinav Upadhyay

If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.

Is that the intention? That doesn't make sense to me.

Expand full comment
author

You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.

However, I have improved it a bit by adding more hash functions. Thanks for noticing!

Expand full comment
Sep 28, 2023Liked by Abhinav Upadhyay

OK, I had thought you had intended to append the seed to what was about to be hashed to produce pseudo random result.

I was expecting the seed to get like this (using the same hash algorithm with every call):

hasher.update((data+str(seed)).encode('utf-8'))

Expand full comment
author

Yeah, that was also an option. This one didn't cross my mind.

Expand full comment
Sep 16, 2023·edited Sep 17, 2023Liked by Abhinav Upadhyay

Hey Abhinav,

great post! I've an article about bloom filter on my private website. But I really liked your version with Python code.

Expand full comment
author

Thank you, NK! I appreciate it.

BTW, I was just checking out your substack. You are performing phenomenally well :)

Expand full comment

Thank you. I think your newsletter has amazing content. I love the fact that you're also into Python and Unix. Great stuff.

Expand full comment
author

Thank you! Those are my favorite topics :)

Btw, I recommended your substack. If you find my substack interesting, I would appreciate if you recommended mine as well. :-)

Expand full comment

Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?

And now I finally get this xkcd comic: https://xkcd.com/2934/

Expand full comment

The math equation has an extraneous / at the end.

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation} \)

I believe should be

(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation})

Expand full comment
author

Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.

Expand full comment