Class HyperLogLogCounterArray

  • All Implemented Interfaces:
    Serializable

    public class HyperLogLogCounterArray
    extends Object
    implements Serializable
    An array of approximate sets each represented using a HyperLogLog counter.

    HyperLogLog counters represent the number of elements of a set in an approximate way. They have been introduced by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Freédeéric Meunier in “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm”, Proceedings of the 13th conference on analysis of algorithm (AofA 07), pages 127−146, 2007. They are an improvement over the basic idea of loglog counting, introduced by Marianne Durand and Philippe Flajolet in “Loglog counting of large cardinalities”, ESA 2003, 11th Annual European Symposium, volume 2832 of Lecture Notes in Computer Science, pages 605−617, Springer, 2003.

    Each counter is composed by m registers, and each register is made of registerSize bits. The first number depends on the desired relative standard deviation, and its logarithm can be computed using log2NumberOfRegisters(double), whereas the second number depends on an upper bound on the number of distinct elements to be counted, and it can be computed using registerSize(long).

    Actually, this class implements an array of counters. Each counter is completely independent, but they all use the same hash function. The reason for this design is that in our intended applications hundred of millions of counters are common, and the JVM overhead to create such a number of objects would be unbearable. This class allocates an array of LongArrayBitVectors, each containing CHUNK_SIZE registers, and can thus handle billions of billions of registers efficiently (in turn, this means being able to handle an array of millions of billions of high-precision counters).

    When creating an instance, you can choose the size of the array (i.e., the number of counters) and the desired relative standard deviation (either explicitly or choosing the number of registers per counter). Then, you can add an element to a counter. At any time, you can count count (approximately) the number of distinct elements that have been added to a counter.

    If you need to reuse this class multiple times, you can clear all registers, possibly setting a new seed. The seed is used to compute the hash function used by the HyperLogLog counters.

    Utility methods

    This class provides a number of utility methods that make it possible to extract a counter as an array of longs, set the contents of a counter given an array of longs, and maximize quickly two counters given as arrays of longs.

    Author:
    Paolo Boldi, Sebastiano Vigna
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected long[][] bits
      An array of arrays of longs containing all registers.
      static long CHUNK_MASK
      The mask used to obtain an register offset in a chunk.
      static int CHUNK_SHIFT
      The logarithm of the maximum size in registers of a bit vector.
      static long CHUNK_SIZE
      The maximum size in registers of a bit vector.
      int counterLongwords
      The size of a counter in longwords (ceiled if there are less then Long.SIZE registers per counter).
      protected long counterResidualMask
      A mask for the residual bits of a counter (the counterSize % Long.SIZE lowest bits).
      protected int counterShift
      The shift that selects the chunk corresponding to a counter.
      int counterSize
      The size in bits of each counter (registerSize * m).
      int log2m
      The logarithm of the number of registers per counter (at most 30).
      protected boolean longwordAligned
      Whether counters are aligned to longwords.
      protected long[] lsbMask
      A mask containing a one in the least significant bit of each register (i.e., in positions of the form registerSize * i).
      int m
      The number of registers per counter.
      protected int mMinus1
      The number of registers minus one.
      protected long[] msbMask
      A mask containing a one in the most significant bit of each register (i.e., in positions of the form registerSize * (i + 1) - 1).
      protected LongBigList[] registers
      registerSize-bit views of bits.
      int registerSize
      The size in bits of each register.
      protected long seed
      A seed for hashing.
    • Constructor Summary

      Constructors 
      Constructor Description
      HyperLogLogCounterArray​(long arraySize, long n, double rsd)
      Creates a new array of counters.
      HyperLogLogCounterArray​(long arraySize, long n, int log2m)
      Creates a new array of counters.
      HyperLogLogCounterArray​(long arraySize, long n, int log2m, long seed)
      Creates a new array of counters.
    • Method Summary

      Modifier and Type Method Description
      void add​(long k, long v)
      Adds an element to a counter.
      int chunk​(long counter)
      Returns the chunk of a given counter.
      void clear()
      Clears all registers.
      void clear​(long seed)
      Clears all registers and sets a new seed (e.g., using Util.randomSeed()).
      double count​(long k)
      Estimates the number of distinct elements that have been added to a given counter so far.
      double count​(long[] bits, long offset)
      Estimates the number of distinct elements that have been added to a given counter so far.
      void getCounter​(long[] chunkBits, long index, long[] dest)
      Extracts a counter from this HyperLogLogCounterArray and writes it into an array of longs.
      void getCounter​(long index, long[] dest)
      Extracts a counter from this HyperLogLogCounterArray and writes it into an array of longs.
      static int log2NumberOfRegisters​(double rsd)
      Returns the logarithm of the number of registers per counter that are necessary to attain a given relative standard deviation.
      void max​(long[] x, long[] y)
      Computes the register-by-register maximum of two counters.
      void max​(long[] x, long[] y, long[] accumulator, long[] mask)
      Computes the register-by-register maximum of two counters.
      long offset​(long counter)
      Returns the bit offset of a given counter in its chunk.
      LongBigList[] registers()
      Returns the array of big lists of registers underlying this array of counters.
      static int registerSize​(long n)
      Returns the register size in bits, given an upper bound on the number of distinct elements.
      static double relativeStandardDeviation​(int log2m)
      Returns the relative standard deviation corresponding to a given logarithm of the number of registers per counter.
      void setCounter​(long[] source, long index)
      Sets the contents of a counter of this HyperLogLogCounterArray using a provided array of longs.
      void setCounter​(long[] source, long[] chunkBits, long index)
      Sets the contents of a counter of this HyperLogLogCounterArray using a provided array of longs.
      void transfer​(long[] source, long[] dest, long node)
      Transfers the content of a counter between two parallel array of longwords.
    • Field Detail

      • CHUNK_SHIFT

        public static final int CHUNK_SHIFT
        The logarithm of the maximum size in registers of a bit vector.
        See Also:
        Constant Field Values
      • CHUNK_SIZE

        public static final long CHUNK_SIZE
        The maximum size in registers of a bit vector.
        See Also:
        Constant Field Values
      • CHUNK_MASK

        public static final long CHUNK_MASK
        The mask used to obtain an register offset in a chunk.
        See Also:
        Constant Field Values
      • mMinus1

        protected final int mMinus1
        The number of registers minus one.
      • bits

        protected final long[][] bits
        An array of arrays of longs containing all registers.
      • counterShift

        protected final int counterShift
        The shift that selects the chunk corresponding to a counter.
      • seed

        protected long seed
        A seed for hashing.
      • longwordAligned

        protected final boolean longwordAligned
        Whether counters are aligned to longwords.
      • counterResidualMask

        protected final long counterResidualMask
        A mask for the residual bits of a counter (the counterSize % Long.SIZE lowest bits).
      • msbMask

        protected final long[] msbMask
        A mask containing a one in the most significant bit of each register (i.e., in positions of the form registerSize * (i + 1) - 1).
      • lsbMask

        protected final long[] lsbMask
        A mask containing a one in the least significant bit of each register (i.e., in positions of the form registerSize * i).
      • log2m

        public final int log2m
        The logarithm of the number of registers per counter (at most 30).
      • m

        public final int m
        The number of registers per counter.
      • registerSize

        public final int registerSize
        The size in bits of each register.
      • counterSize

        public final int counterSize
        The size in bits of each counter (registerSize * m).
      • counterLongwords

        public final int counterLongwords
        The size of a counter in longwords (ceiled if there are less then Long.SIZE registers per counter).
    • Constructor Detail

      • HyperLogLogCounterArray

        public HyperLogLogCounterArray​(long arraySize,
                                       long n,
                                       double rsd)
        Creates a new array of counters.
        Parameters:
        arraySize - the number of counters.
        n - the expected number of elements.
        rsd - the relative standard deviation.
      • HyperLogLogCounterArray

        public HyperLogLogCounterArray​(long arraySize,
                                       long n,
                                       int log2m)
        Creates a new array of counters.
        Parameters:
        arraySize - the number of counters.
        n - the expected number of elements.
        log2m - the logarithm of the number of registers per counter (at most 30).
      • HyperLogLogCounterArray

        public HyperLogLogCounterArray​(long arraySize,
                                       long n,
                                       int log2m,
                                       long seed)
        Creates a new array of counters.
        Parameters:
        arraySize - the number of counters.
        n - the expected number of elements.
        log2m - the logarithm of the number of registers per counter (at most 30).
        seed - the seed used to compute the hash function.
    • Method Detail

      • log2NumberOfRegisters

        public static int log2NumberOfRegisters​(double rsd)
        Returns the logarithm of the number of registers per counter that are necessary to attain a given relative standard deviation.
        Parameters:
        rsd - the relative standard deviation to be attained.
        Returns:
        the logarithm of the number of registers that are necessary to attain relative standard deviation rsd.
      • relativeStandardDeviation

        public static double relativeStandardDeviation​(int log2m)
        Returns the relative standard deviation corresponding to a given logarithm of the number of registers per counter.
        Parameters:
        log2m - the logarithm of the number of registers per counter (at most 30).
        Returns:
        the resulting relative standard deviation.
      • registerSize

        public static int registerSize​(long n)
        Returns the register size in bits, given an upper bound on the number of distinct elements.
        Parameters:
        n - an upper bound on the number of distinct elements.
        Returns:
        the register size in bits.
      • chunk

        public int chunk​(long counter)
        Returns the chunk of a given counter.
        Parameters:
        counter - a counter.
        Returns:
        its chunk.
      • offset

        public long offset​(long counter)
        Returns the bit offset of a given counter in its chunk.
        Parameters:
        counter - a counter.
        Returns:
        the starting bit of the given counter in its chunk.
      • clear

        public void clear​(long seed)
        Clears all registers and sets a new seed (e.g., using Util.randomSeed()).
        Parameters:
        seed - the new seed used to compute the hash function
      • clear

        public void clear()
        Clears all registers.
      • add

        public void add​(long k,
                        long v)
        Adds an element to a counter.
        Parameters:
        k - the index of the counter.
        v - the element to be added.
      • registers

        public LongBigList[] registers()
        Returns the array of big lists of registers underlying this array of counters.

        The main purpose of this method is debugging, as it makes comparing the evolution of the state of two implementations easy.

        Returns:
        the array of big lists of registers underlying this array of counters.
      • count

        public double count​(long[] bits,
                            long offset)
        Estimates the number of distinct elements that have been added to a given counter so far.

        This is an low-level method that should be used only after having understood in detail the inner workings of this class.

        Parameters:
        bits - the bit array containing the counter.
        offset - the starting bit position of the counter in bits.
        Returns:
        an approximation of the number of distinct elements that have been added to counter so far.
      • count

        public double count​(long k)
        Estimates the number of distinct elements that have been added to a given counter so far.
        Parameters:
        k - the index of the counter.
        Returns:
        an approximation of the number of distinct elements that have been added to counter k so far.
      • setCounter

        public final void setCounter​(long[] source,
                                     long[] chunkBits,
                                     long index)
        Sets the contents of a counter of this HyperLogLogCounterArray using a provided array of longs.

        Warning: this is a low-level method. You must know what you're doing.

        Parameters:
        source - an array of at least counterLongwords longs containing a counter.
        chunkBits - the array where the counter will be stored.
        index - the index of the counter that will be filled with the provided array.
        See Also:
        getCounter(long[], long, long[])
      • setCounter

        public void setCounter​(long[] source,
                               long index)
        Sets the contents of a counter of this HyperLogLogCounterArray using a provided array of longs.

        Warning: this is a low-level method. You must know what you're doing.

        Parameters:
        source - an array of at least counterLongwords longs containing a counter.
        index - the index of the counter that will be filled with the provided array.
        See Also:
        setCounter(long[], long[], long)
      • getCounter

        public final void getCounter​(long[] chunkBits,
                                     long index,
                                     long[] dest)
        Extracts a counter from this HyperLogLogCounterArray and writes it into an array of longs.

        Warning: this is a low-level method. You must know what you're doing.

        Parameters:
        chunkBits - the array storing the counter.
        index - the index of the counter to be extracted.
        dest - an array of at least counterLongwords longs where the counter of given index will be written.
        See Also:
        setCounter(long[], long[], long)
      • getCounter

        public final void getCounter​(long index,
                                     long[] dest)
        Extracts a counter from this HyperLogLogCounterArray and writes it into an array of longs.

        Warning: this is a low-level method. You must know what you're doing.

        Parameters:
        index - the index of the counter to be extracted.
        dest - an array of at least counterLongwords longs where the counter of given index will be written.
        See Also:
        getCounter(long[], long, long[])
      • transfer

        public final void transfer​(long[] source,
                                   long[] dest,
                                   long node)
        Transfers the content of a counter between two parallel array of longwords.

        Warning: this is a low-level method. You must know what you're doing.

        Parameters:
        source - the source array.
        dest - the destination array.
        node - the node number.
      • max

        public final void max​(long[] x,
                              long[] y)
        Computes the register-by-register maximum of two counters.

        This method will allocate two temporary arrays. To reduce object creation, use max(long[], long[], long[], long[]).

        Parameters:
        x - a first array of at least counterLongwords longs containing a counter.
        y - a second array of at least counterLongwords longs containing a counter.
      • max

        public final void max​(long[] x,
                              long[] y,
                              long[] accumulator,
                              long[] mask)
        Computes the register-by-register maximum of two counters.
        Parameters:
        x - a first array of at least counterLongwords longs containing a counter.
        y - a second array of at least counterLongwords longs containing a counter.
        accumulator - a support array of at least counterLongwords longs.
        mask - a support array of at least counterLongwords longs.