Bloom filter

less than 1 minute read

A cool data structure that I’ve come across in the past, but have not yet had a chance to use directly is a Bloom filter. It’s a probabilistic data structure that can act as a set, and is both space and time efficient (constant!). Bloom filters have found many useful applications, typically in preventing the occurrence of some more expensive lookup or access.

Naturally, I’ve had a go at implementing it myself. You can find the source code here (Python).

The general idea: to insert an element, run it through a number of hash functions, each generating an index into a bit array, and set the bits at those positions in the array. Then, to query for an element (i.e. is a given element a member of this set?), run it through those same hash functions and check whether all the bits at those positions in the array are set. Note: Bloom filters come with two key caveats: (i) when querying, you may get false positives, and (ii) items cannot be removed (which would introduce false negatives).

A wonderful explanation of Bloom Filters can be found here and a neat trick to generate several hash functions can be found here.