If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.
Is that the intention? That doesn't make sense to me.
You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.
However, I have improved it a bit by adding more hash functions. Thanks for noticing!
Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?
Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.
If I'm reading your _hash function properly I think all of the returned hashes will be consecutive values wrapping around at self.size. The seed doesn't contribute to the generated hash value and as _hash is called with the same item value and consecutive seeds are used it will result in a continuous block of bits being set to 1.
Is that the intention? That doesn't make sense to me.
You are right, Kyle. In a real-world implementation I would have used something like MurMur hash which takes a seed as input to serve the need of multiple hash functions required by the Bloom filter. But I wanted to keep it simple and avoid dependencies, in case someone just wanted to play around with this code. In doing so, I made it too simplistic.
However, I have improved it a bit by adding more hash functions. Thanks for noticing!
OK, I had thought you had intended to append the seed to what was about to be hashed to produce pseudo random result.
I was expecting the seed to get like this (using the same hash algorithm with every call):
hasher.update((data+str(seed)).encode('utf-8'))
Yeah, that was also an option. This one didn't cross my mind.
Hey Abhinav,
great post! I've an article about bloom filter on my private website. But I really liked your version with Python code.
Thank you, NK! I appreciate it.
BTW, I was just checking out your substack. You are performing phenomenally well :)
Thank you. I think your newsletter has amazing content. I love the fact that you're also into Python and Unix. Great stuff.
Thank you! Those are my favorite topics :)
Btw, I recommended your substack. If you find my substack interesting, I would appreciate if you recommended mine as well. :-)
Good post, I had a little fun whipping up some simple Python classes for plain and counting Bloom filters. Question: Why bother with the bit vector when using counters? Can't you just test to see if a counter is zero?
And now I finally get this xkcd comic: https://xkcd.com/2934/
The math equation has an extraneous / at the end.
(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation} \)
I believe should be
(\begin{equation} \text{False Positive Rate} \approx \left(1 - e^{-\left(\frac{kn}{m}\right)^k}\right) \end{equation})
Hey @Erffrfez, thanks for noticing. The underlying LaTeX that I entered is same as you provided. It seems in the rendered version when you double click on the formula, substack adds an extra backslash at the end. Most probably a bug on their end.