Class TransformationStrategies

java.lang.Object
it.unimi.dsi.bits.TransformationStrategies

public class TransformationStrategies extends Object
A class providing static methods and objects that do useful things with transformation strategies.

This class provides several transformation strategies that turn strings or other objects into bit vectors. The transformations might optionally be:

  • Lexicographical: for objects based on bytes or characters, such as strings and byte arrays, this means that the first bit of the bit vector is the most significant bit of the first byte or character, and so on. In other word, the lexicographical order between bit vectors reflects the lexicographical byte-by-byte, char-by-char, etc. order. Thiss property is necessary for some kind of static structure that depends on it, but it has some computational cost, as after compacting byte or chars into a long we need to revert the bit order of each piece.
  • Prefix-free: no two bit vector returned by the transformation on two different objects will be comparable in prefix order. Again, this might require to use more linear (e.g., prefixFree()) or constant (e.g., prefixFreeIso()) additional space.

As a general rule, transformations without additional naming are lexicographical. Transformation that generate prefix-free bit vectors are marked as such. Plain transformations that do not provide any guarantee are called raw. They should be used only when performance is the main issue and the two properties above are not relevant.

See Also:
  • Constructor Details

    • TransformationStrategies

      public TransformationStrategies()
  • Method Details

    • identity

      public static <T extends BitVector> TransformationStrategy<T> identity()
      A trivial transformation for data already in BitVector form.
    • rawUtf32

      public static <T extends CharSequence> TransformationStrategy<T> rawUtf32()
      A trivial raw transformation from strings to bit vectors that turns the UTF-16 representation into a UTF-32 representation, decodes surrogate pairs and concatenates the bits of the UTF-32 representation.

      Warning: this transformation is not lexicographic.

    • utf32

      public static <T extends CharSequence> TransformationStrategy<T> utf32()
      A transformation from strings to bit vectors that turns the UTF-16 representation into a UTF-32 representation, decodes surrogate pairs and concatenates the bits of the UTF-32 representation.
    • prefixFreeUtf32

      public static <T extends CharSequence> TransformationStrategy<T> prefixFreeUtf32()
      A transformation from strings to bit vectors that turns the UTF-16 representation into a UTF-32 representation, decodes surrogate pairs, concatenates the bits of the UTF-32 representation and completes the representation with an NUL to guarantee lexicographical ordering and prefix-freeness.

      Note that strings provided to this strategy must not contain NULs.

    • rawUtf16

      public static <T extends CharSequence> TransformationStrategy<T> rawUtf16()
      A trivial, high-performance, raw transformation from strings to bit vectors that concatenates the bits of the UTF-16 representation.

      Warning: this transformation is not lexicographic.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • utf16

      public static <T extends CharSequence> TransformationStrategy<T> utf16()
      A trivial transformation from strings to bit vectors that concatenates the bits of the UTF-16 representation.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • prefixFreeUtf16

      public static <T extends CharSequence> TransformationStrategy<T> prefixFreeUtf16()
      A trivial transformation from strings to bit vectors that concatenates the bits of the UTF-16 representation and completes the representation with an NUL to guarantee lexicographical ordering and prefix-freeness.

      Note that strings provided to this strategy must not contain NULs.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • rawIso

      public static <T extends CharSequence> TransformationStrategy<T> rawIso()
      A trivial, high-performance, raw transformation from strings to bit vectors that concatenates the lower eight bits bits of the UTF-16 representation.

      Warning: this transformation is not lexicographic.

      Note that this transformation is sensible only for strings that are known to be contain just characters in the ISO-8859-1 charset.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • iso

      public static <T extends CharSequence> TransformationStrategy<T> iso()
      A trivial transformation from strings to bit vectors that concatenates the lower eight bits of the UTF-16 representation.

      Note that this transformation is sensible only for strings that are known to be contain just characters in the ISO-8859-1 charset.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • prefixFreeIso

      public static <T extends CharSequence> TransformationStrategy<T> prefixFreeIso()
      A trivial transformation from strings to bit vectors that concatenates the lower eight bits bits of the UTF-16 representation and completes the representation with an ASCII NUL to guarantee lexicographical ordering and prefix-freeness.

      Note that this transformation is sensible only for strings that are known to be contain just characters in the ISO-8859-1 charset, and that strings provided to this strategy must not contain ASCII NULs.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • rawByteArray

      public static TransformationStrategy<byte[]> rawByteArray()
      A trivial, high-performance, raw transformation from byte arrays to bit vectors that simply concatenates the bytes of the array.

      Warning: this transformation is not lexicographic.

      Warning: bit vectors returned by this strategy are adaptors around the original array. If the array changes while the bit vector is being accessed, the results will be unpredictable.

      See Also:
    • byteArray

      public static TransformationStrategy<byte[]> byteArray()
      A lexicographical transformation from byte arrays to bit vectors.

      Warning: bit vectors returned by this strategy are adaptors around the original array. If the array changes while the bit vector is being accessed, the results will be unpredictable.

      See Also:
    • prefixFreeByteArray

      public static TransformationStrategy<byte[]> prefixFreeByteArray()
      A lexicographical transformation from byte arrays to bit vectors that completes the representation with a zero to guarantee lexicographical ordering and prefix-freeness provided the byte arrays to not contain zeros.

      This transformation is mainly intended for byte arrays representing ASCII strings in compact form.

      Warning: bit vectors returned by this strategy are adaptors around the original array. If the array changes while the bit vector is being accessed, the results will be unpredictable.

      See Also:
    • wrap

      public static <T> Iterator<BitVector> wrap(Iterator<T> iterator, TransformationStrategy<? super T> transformationStrategy)
      Wraps a given iterator, returning an iterator that emits bit vectors.
      Parameters:
      iterator - an iterator.
      transformationStrategy - a strategy to transform the object returned by iterator.
      Returns:
      an iterator that emits the content of iterator passed through transformationStrategy.
    • wrap

      public static <T> Iterable<BitVector> wrap(Iterable<T> iterable, TransformationStrategy<? super T> transformationStrategy)
      Wraps a given iterable, returning an iterable that contains bit vectors.
      Parameters:
      iterable - an iterable.
      transformationStrategy - a strategy to transform the object contained in iterable.
      Returns:
      an iterable that has the content of iterable passed through transformationStrategy.
    • wrap

      public static <T> List<BitVector> wrap(List<T> list, TransformationStrategy<? super T> transformationStrategy)
      Wraps a given list, returning a list that contains bit vectors.
      Parameters:
      list - a list.
      transformationStrategy - a strategy to transform the object contained in list.
      Returns:
      a list that has the content of list passed through transformationStrategy.
    • prefixFree

      public static <T extends BitVector> TransformationStrategy<T> prefixFree()
      A transformation from bit vectors to bit vectors that guarantees that its results are prefix free.

      More in detail, we map 0 to 10, 1 to 11, and we add a 0 at the end of all strings.

      Warning: bit vectors returned by this strategy are adaptors around the original string. If the string changes while the bit vector is being accessed, the results will be unpredictable.

    • fixedLong

      public static TransformationStrategy<Long> fixedLong()
      A transformation from longs to bit vectors that returns a fixed-size Long.SIZE-bit vector. Note that the bit vectors have as first bit the most significant bit of the underlying long integer, and that the first bit of the representation is flipped, so lexicographical and numerical order coincide.
      Implementation Notes:
      The flipping of the most significant bit was implemented in 2.6.18 to match lexicographical and numerical order for negative numbers, too, and made it necessary to bump the serial version of the strategy.
    • rawFixedLong

      public static TransformationStrategy<Long> rawFixedLong()
      A trivial, high-performance, raw transformation from longs to bit vectors that returns a fixed-size Long.SIZE-bit vector.
      Implementation Notes:
      Implementing fixedLong() lexicographical order for all numbers in 2.6.18 made it necessary to bump the serial version of this strategy, too.