Class PermutedFrontCodedStringList

  • All Implemented Interfaces:
    ObjectCollection<CharSequence>, ObjectIterable<CharSequence>, ObjectList<CharSequence>, Stack<CharSequence>, Serializable, Comparable<List<? extends CharSequence>>, Iterable<CharSequence>, Collection<CharSequence>, List<CharSequence>

    public class PermutedFrontCodedStringList
    extends AbstractObjectList<CharSequence>
    implements Serializable
    A FrontCodedStringList whose indices are permuted.

    It may happen that a list of strings compresses very well using front coding, but unfortunately alphabetical order is not the right order for the strings in the list. Instances of this class wrap an instance of FrontCodedStringList together with a permutation π: inquiries with index i will actually return the string with index πi.

    In case you start from a newline-delimited non-sorted list of UTF-8 strings, the simplest way to build an instance of this map is obtaining a front-coded string list and a permutation with a simple UN*X pipe (which also avoids storing the sorted strings):

     nl -v0 -nln | sort -k2 | tee >(cut -f1 >perm.txt) \
            | cut -f2 | java it.unimi.dsi.util.FrontCodedStringList tmp-lex.fcl
     
    The above command will read a list of strings from standard input, output a their sorted index list in perm.txt and create a tmp-lex.fcl front-coded string list containing the sorted list of strings.

    Important: you must be sure to be using the byte-by-byte collation order—in UN*X, be sure that LC_COLLATE=C. Failure to do so will result in an order-of-magnitude-slower sorting and worse compression.

    Now, in perm.txt you will find the permutation that you have to pass to this class (given that you will use the option -i). So the last step is just

     java it.unimi.dsi.util.PermutedFrontCodedStringList -i -t tmp-lex.fcl perm.txt your.fcl
     
    See Also:
    Serialized Form