public class BloomFilter<T> extends Object implements Serializable, Size64
Instances of this class represent a set of elements (with false positives) using a Bloom filter (Burton H. Bloom, “Space/time tradeoffs in hash coding with allowable errors”, Comm. ACM 13(7):422−426, 1970). Because of the way Bloom filters work, you cannot remove elements.
Given a maximum number of elements, Bloom filters have an expected error rate that depends on the number of hash functions used. More precisely, a Bloom filter for at most n elements with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2^{d}. Adding more than n elements will result in a higher rate of false positives. You can conveniently build a filter by specifying the number of hash function, by specifying the expected false positive rate, or just requesting essentially no false positives.
The maximum number of bits supported is 137438953408L, which makes it possible to store with high precision several dozens billion elements.
This class exports access methods that are similar to those of Set
, but it
does not implement that interface, as too many nonoptional methods would be unimplementable
(e.g., iterators). To store generic objects of type T
, we rely on MurmurHash3
and Google Guava's funnels, which are strategies turning an object into a sequence of bytes.
There are predefined methods for storing character sequences,
byte and character arrays,
integers and longs; they use the readymade funnels available in Funnels
.
You can define your own Funnel
and store objects correspondingly.
If you intend to storage sequences of bytes which are already random looking (e.g., MD5 digests) you can
use the addHash(byte[])
/containsHash(byte[])
methods, which use the first 16 bytes
of the argument byte array directly, without further hashing.
If you plan to use predefined methods only, the suggested way to instantiate this class is to use the
factory methods without Funnel
arguments (e.g., create(long, int)
). They
return a filter of generic type Void
using a null
funnel
that will be never invoked (unless you circument the generics typesafety mechanisms).
A main method makes it easy to create serialized Bloom filters starting from a list of strings.
To generate several hash functions we use the technique suggested by Adam Kirsch and Michael Mitzenmacher in “Less hashing,
same performance: Building a better Bloom filter”, Random Structures &
Algorithms, 33(2):187−218, John Wiley & Sons, 2008. Two 64bit hashes h_{0} and
h_{1} are generated using Hashing.murmur3_128()
(or taken from the argument,
in case of addHash(byte[])
/containsHash(byte[])
). Then, the hash
function of index i is simply h_{0} + ih_{1} (i ≥ 0). The
paper proves that this choice does not worsen the rate of false positives.
Modifier and Type  Field and Description 

static Funnel<byte[]> 
BYTE_ARRAY_FUNNEL

static Funnel<Integer> 
INTEGER_FUNNEL

static Funnel<Long> 
LONG_FUNNEL

static long 
MAX_BITS
The maximum number of bits in a filter (limited by array size and bits in a long).

static Funnel<CharSequence> 
STRING_FUNNEL

Modifier  Constructor and Description 

protected 
BloomFilter(long n,
int d,
Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.

Modifier and Type  Method and Description 

boolean 
add(byte[] a)
Adds a byte array to this filter.

boolean 
add(char[] a)
Adds a character array to this filter.

boolean 
add(CharSequence s)
Adds a character sequence to this filter.

boolean 
add(int x)
Adds an integer to this filter.

boolean 
add(long x)
Adds a long to this filter.

boolean 
add(T e)
Adds an object of generic type to this filter using the funnel specified at construction time.

<V> boolean 
add(V e,
Funnel<V> funnel)
Adds an object to this filter using a specified funnel.

boolean 
addHash(byte[] hash)
Adds a hash code to this filter.

void 
clear()
Clears this filter.

boolean 
contains(byte[] a)
Checks whether the given byte array is in this filter.

boolean 
contains(char[] a)
Checks whether the given character array is in this filter.

boolean 
contains(CharSequence s)
Checks whether the given character sequence is in this filter.

boolean 
contains(int x)
Adds an integer is in this filter.

boolean 
contains(long x)
Checks whether the given long is in this filter.

boolean 
contains(T e)
Checks whether an object of generic type is in this filter using the funnel specified at construction time.

boolean 
containsHash(byte[] hash)
Checks whether a hash code is in this filter.

static BloomFilter<Void> 
create(long n)
Creates a new highprecision Bloom filter a given expected number of elements.

static BloomFilter<Void> 
create(long n,
double precision)
Creates a new Bloom filter on
Void with given precision and expected number of elements. 
static <T> BloomFilter<T> 
create(long n,
double precision,
Funnel<T> funnel)
Creates a new Bloom filter on
Void with given precision and expected number of elements of given type. 
static <T> BloomFilter<T> 
create(long n,
Funnel<T> funnel)
Creates a new highprecision Bloom filter a given expected number of elements of given type.

static BloomFilter<Void> 
create(long n,
int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.

static <T> BloomFilter<T> 
create(long n,
int d,
Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.

static void 
main(String[] arg) 
int 
size()
Deprecated.

long 
size64()
Returns the size of this filter.

public static final Funnel<byte[]> BYTE_ARRAY_FUNNEL
public static final Funnel<CharSequence> STRING_FUNNEL
public static final long MAX_BITS
protected BloomFilter(long n, int d, Funnel<T> funnel)
n
 the expected number of elements.d
 the number of hash functions; if no more than n
elements are added to this filter,
false positives will happen with probability 2^{d}.funnel
 a funnel for the elements of this filter.public static <T> BloomFilter<T> create(long n, Funnel<T> funnel)
This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.
n
 the expected number of elements.funnel
 a funnel for the elements of this filter (use create(long)
if you
plan on using only the predefined methods).public static <T> BloomFilter<T> create(long n, int d, Funnel<T> funnel)
n
 the expected number of elements.d
 the number of hash functions; if no more than n
elements are added to this filter,
false positives will happen with probability 2^{d}.funnel
 a funnel for the elements of this filter (use create(long, int)
if you
plan on using only the predefined methods).public static <T> BloomFilter<T> create(long n, double precision, Funnel<T> funnel)
Void
with given precision and expected number of elements of given type.n
 the expected number of elements.precision
 the expected fraction of false positives; if no more than n
elements are added to this filter,
false positives will happen with no more than this probability.
plan on using only the predefined methods).funnel
 a funnel for the elements of this filter (use create(long, double)
if you
plan on using only the predefined methods).public static BloomFilter<Void> create(long n)
Filters created using this method will be accessible using predefined methods only.
Use create(long, Funnel)
if you need a generic filter.
This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.
n
 the expected number of elements.public static BloomFilter<Void> create(long n, int d)
Filters created using this method will be accessible using predefined methods only.
Use create(long, int, Funnel)
if you need a generic filter.
n
 the expected number of elements.d
 the number of hash functions; if no more than n
elements are added to this filter,
false positives will happen with probability 2^{d}.public static BloomFilter<Void> create(long n, double precision)
Void
with given precision and expected number of elements.
Filters created using this method will be accessible using predefined methods only.
Use create(long, double, Funnel)
if you need a generic filter.
n
 the expected number of elements.precision
 the expected fraction of false positives; if no more than n
elements are added to this filter,
false positives will happen with no more than this probability.
plan on using only the predefined methods).BloomFilter(long, int, Funnel)
public boolean add(CharSequence s)
s
 a character sequence.public boolean add(byte[] a)
a
 a byte array.public boolean add(char[] a)
a
 a character array.public boolean add(int x)
x
 an integer.public boolean add(long x)
x
 a long.public boolean add(T e)
e
 an object.public <V> boolean add(V e, Funnel<V> funnel)
e
 an object.funnel
 a funnel for object
.public boolean addHash(byte[] hash)
This method uses the first 16 bytes of a byte array to build two 64bit hashes. The intended usage is storing digests and similar alreadyhashed values.
hash
 a byte array of at least 16 elements containing a hash code.ArrayIndexOutOfBoundsException
 if hash
is shorter than 16.containsHash(byte[])
public boolean contains(CharSequence s)
Note that this method may return true on a character sequence that has never been added to this filter. This will happen with probability 2^{d}, where d is the number of hash functions specified at creation time, if the number of the elements in this filter is less than n, the number of expected elements specified at creation time.
s
 a character sequence.s
is in this filter.public boolean contains(byte[] a)
a
 a byte array.a
is in this filter.contains(CharSequence)
public boolean contains(char[] a)
a
 a character array.a
is in this filter.contains(CharSequence)
public boolean contains(int x)
x
 an integer.x
is in this filter.contains(CharSequence)
public boolean contains(long x)
x
 a long.x
is in this filter.contains(CharSequence)
public boolean contains(T e)
e
 an element.e
is in this filter.contains(CharSequence)
public boolean containsHash(byte[] hash)
This method uses the first 16 bytes of a byte array to build two 64bit hashes. The intended usage is storing digests and similar alreadyhashed values.
hash
 a byte array of at least 16 elements containing a hash code.hash
is in this filter.ArrayIndexOutOfBoundsException
 if hash
is shorter than 16.addHash(byte[])
public void clear()
public long size64()
Note that the size of a Bloom filter is only a lower bound for the number of distinct elements that have been added to this filter. False positives might make the number returned by this method smaller than it should be.
@Deprecated public int size()
public static void main(String[] arg) throws IOException, JSAPException, NoSuchMethodException