Class ImmutableExternalPrefixMap

  • All Implemented Interfaces:
    Function<CharSequence,​Long>, Object2LongFunction<CharSequence>, PrefixMap<MutableString>, StringMap<MutableString>, Serializable, Function<CharSequence,​Long>, ToLongFunction<CharSequence>

    public class ImmutableExternalPrefixMap
    extends AbstractPrefixMap
    implements Serializable
    An immutable prefix map mostly stored in external memory.

    Instances of this class write on a dump stream the string in the domain of the prefix map. More precisely, the dump stream is formed by blocks, and each block (with user-definable length, possibly the size of a basic disk I/O operation) is filled as much as possible with strings front coded and compressed with a HuTuckerCodec. Each block starts with the length of the first string in unary, followed by the encoding of the string. Then, for each string we write in unary the length of the common prefix (in characters) with the previous string, the length of the remaining suffix (in characters) and finally the encoded suffix. Note that if the encoding of a string is longer than a block, the string will occupy more than one block.

    We keep track using an ImmutableBinaryTrie of the strings at the start of each block in Hu–Tucker coding: thus, we are able to retrieve the interval corresponding to a given prefix by calling getApproximatedInterval() and scanning at most two blocks.

    Self-contained or non-self-contained

    There are two kinds of external prefix maps: self-contained and non-self-contained. In the first case, you get a serialised object that you can load at any time. The dump stream is serialised with the object and expanded at each deserialisation in the Java temporary directory. If you deserialise a map several times, you will get correspondingly many copies of the dump stream in the temporary directory. The dump streams are deleted when the JVM exits. This mechanism is not very efficient, but since this class implements several interfaces it is essential that clients can make things work in a standard way.

    Alternatively, you can give at creation time a filename for the dump stream. The resulting non-self-contained external prefix map can be serialised, but after deserialisation you need to set the dump stream filename or even directly the dump stream (for instance, to an input bit stream wrapping a byte array where the dump stream has been loaded). You can deserialise many copies of an external prefix map, letting all copies share the same dump stream.

    This data structure is not synchronised, and concurrent reads may cause problems because of clashes in the usage of the underlying input bit stream. It would not be a good idea in any case to open a new stream for each caller, as that would certainly lead to disk thrashing.

    The main method of this class helps in building large external prefix maps.

    Since:
    0.9.3
    Author:
    Sebastiano Vigna
    See Also:
    Serialized Form
    • Field Detail

      • STD_BLOCK_SIZE

        public static final int STD_BLOCK_SIZE
        The standard block size (in bytes).
        See Also:
        Constant Field Values
      • CACHE_MAX_SIZE

        public static final int CACHE_MAX_SIZE
        The maximum number of entry in the cache map.
        See Also:
        Constant Field Values
      • intervalApproximator

        protected final ImmutableBinaryTrie<CharSequence> intervalApproximator
        The in-memory data structure used to approximate intervals..
      • blockSize

        protected final long blockSize
        The block size of this (in bits).
      • decoder

        protected final Decoder decoder
        A decoder used to read data from the dump stream.
      • symbol2char

        protected final char[] symbol2char
        A map (given by an array) from symbols in the coder to characters.
      • char2symbol

        protected final Char2IntOpenHashMap char2symbol
        A map from characters to symbols of the coder.
      • size

        protected final int size
        The number of terms in this map.
      • blockStart

        protected final int[] blockStart
        The index of the first word in each block, plus an additional entry containing size.
      • blockOffset

        protected final int[] blockOffset
        An array parallel to blockStart giving the offset in blocks in the dump file of the corresponding word in blockStart. If there are no overflows, this will just be an initial segment of the natural numbers, but overflows cause jumps.
      • selfContained

        protected final boolean selfContained
        Whether this map is self-contained.
      • iteratorIsUsable

        protected transient boolean iteratorIsUsable
        If true, the creation of the last DumpStreamIterator was not followed by a call to any get method.
      • dumpStream

        protected transient InputBitStream dumpStream
        A reference to the dump stream.
    • Constructor Detail

      • ImmutableExternalPrefixMap

        public ImmutableExternalPrefixMap​(Iterable<? extends CharSequence> terms,
                                          int blockSizeInBytes,
                                          CharSequence dumpStreamFilename)
                                   throws IOException
        Creates an external prefix map with specified block size and dump stream.

        This constructor does not assume that CharSequence instances returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

        Parameters:
        terms - an iterable whose iterator will enumerate in lexicographical order the terms for the map.
        blockSizeInBytes - the block size (in bytes).
        dumpStreamFilename - the name of the dump stream, or null for a self-contained map.
        Throws:
        IOException
      • ImmutableExternalPrefixMap

        public ImmutableExternalPrefixMap​(Iterable<? extends CharSequence> terms,
                                          CharSequence dumpStreamFilename)
                                   throws IOException
        Creates an external prefix map with block size STD_BLOCK_SIZE and specified dump stream.

        This constructor does not assume that CharSequence instances returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

        Parameters:
        terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
        dumpStreamFilename - the name of the dump stream, or null for a self-contained map.
        Throws:
        IOException
      • ImmutableExternalPrefixMap

        public ImmutableExternalPrefixMap​(Iterable<? extends CharSequence> terms,
                                          int blockSizeInBytes)
                                   throws IOException
        Creates an external prefix map with specified block size.

        This constructor does not assume that CharSequence instances returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

        Parameters:
        blockSizeInBytes - the block size (in bytes).
        terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
        Throws:
        IOException
      • ImmutableExternalPrefixMap

        public ImmutableExternalPrefixMap​(Iterable<? extends CharSequence> terms)
                                   throws IOException
        Creates an external prefix map with block size STD_BLOCK_SIZE.

        This constructor does not assume that CharSequence instances returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

        Parameters:
        terms - a collection whose iterator will enumerate in lexicographical order the terms for the map.
        Throws:
        IOException