Probabilistic Data Structures 1

Probabilistic data structures are a group of data structures which are very much useful for big data and streaming applications because they consume a very little memory. These data structures do not provide an exact answer but provide a reasonable approximation for the answer.

Most popular probabilistic data structures

1. Bloom Filters
2. Count-min Sketch
3. HyperLogLog

Bloom Filters

This data structure is mainly used to check whether an element belongs to a set. In Bloom Filters false positive matches are possible which means that if the data structure shows that an element is not in the given set that means it is absolutely not in the set. False negatives are not possible in Bloom Filters which means if the data structure shows that an element is in a set, but realistically this might be in the set or sometimes may, not be in the set.
An empty Bloom Filter is a  bit array of m number of zeros. Different k number of hash functions also should be defined.
  • n: how many items you expect to have in your filter (e.g. 216533)
  • p: your acceptable false positive rate {0..1} (e.g. 0.01 → 1%)
 The number of bits needed in the Bloom Filter
         m = -n*ln(p) / (ln(2)^2)    
The number of hash functions
         k = m/n * ln(2)             

When adding elements to this bit array, first the element should be feed into k hash functions and get the output of k array positions. Then the bit value should be changed to 1 in k array positions.

When querying for an item, it should be first feed into k hash functions and get those k array positions. If any of these positions is zero, then the element is definitely not in the set. If the value of all bits is 1, then there are two cases exist.
  • case1: Item is in the set
  • case2: Item is not in the set, they have 1 in the corresponding array positions because they have been set to 1 during the insertion of other elements.
Simple bloom filters cannot distinguish these two cases but more advanced techniques can address this problem.
Removing an element from bloom filter is not permitted because it might remove some other elements that happen to map onto that bit. By removing elements from the bloom filter may introduce the possibility for false negatives hence it is not allowed in bloom filters.




Comments

Popular posts from this blog

Chef Deployment Tool