Class Trie

  • Direct Known Subclasses:
    CharTrie, IntTrie

    public abstract class Trie
    extends Object

    A trie is a kind of compressed, serializable table of values associated with Unicode code points (0..0x10ffff).

    This class defines the basic structure of a trie and provides methods to retrieve the offsets to the actual data.

    Data will be the form of an array of basic types, char or int.

    The actual data format will have to be specified by the user in the inner static interface com.ibm.icu.impl.Trie.DataManipulate.

    This trie implementation is optimized for getting offset while walking forward through a UTF-16 string. Therefore, the simplest and fastest access macros are the fromLead() and fromOffsetTrail() methods. The fromBMP() method are a little more complicated; they get offsets even for lead surrogate codepoints, while the fromLead() method get special "folded" offsets for lead surrogate code units if there is relevant data associated with them. From such a folded offsets, an offset needs to be extracted to supply to the fromOffsetTrail() methods. To handle such supplementary codepoints, some offset information are kept in the data.

    Methods in com.ibm.icu.impl.Trie.DataManipulate are called to retrieve that offset from the folded value for the lead surrogate unit.

    For examples of use, see com.ibm.icu.impl.CharTrie or com.ibm.icu.impl.IntTrie.

    Since:
    release 2.1, Jan 01 2002
    See Also:
    CharTrie, IntTrie
    • Field Detail

      • LEAD_INDEX_OFFSET_

        protected static final int LEAD_INDEX_OFFSET_
        Lead surrogate code points' index displacement in the index array. 0x10000-0xd800=0x2800 0x2800 >> INDEX_STAGE_1_SHIFT_
        See Also:
        Constant Field Values
      • INDEX_STAGE_1_SHIFT_

        protected static final int INDEX_STAGE_1_SHIFT_
        Shift size for shifting right the input index. 1..9
        See Also:
        Constant Field Values
      • INDEX_STAGE_2_SHIFT_

        protected static final int INDEX_STAGE_2_SHIFT_
        Shift size for shifting left the index array values. Increases possible data size with 16-bit index values at the cost of compactability. This requires blocks of stage 2 data to be aligned by DATA_GRANULARITY. 0..INDEX_STAGE_1_SHIFT
        See Also:
        Constant Field Values
      • DATA_BLOCK_LENGTH

        protected static final int DATA_BLOCK_LENGTH
        Number of data values in a stage 2 (data array) block.
        See Also:
        Constant Field Values
      • INDEX_STAGE_3_MASK_

        protected static final int INDEX_STAGE_3_MASK_
        Mask for getting the lower bits from the input index. DATA_BLOCK_LENGTH - 1.
        See Also:
        Constant Field Values
      • SURROGATE_BLOCK_BITS

        protected static final int SURROGATE_BLOCK_BITS
        Number of bits of a trail surrogate that are used in index table lookups.
        See Also:
        Constant Field Values
      • SURROGATE_BLOCK_COUNT

        protected static final int SURROGATE_BLOCK_COUNT
        Number of index (stage 1) entries per lead surrogate. Same as number of index entries for 1024 trail surrogates, ==0x400>>INDEX_STAGE_1_SHIFT_
        See Also:
        Constant Field Values
      • BMP_INDEX_LENGTH

        protected static final int BMP_INDEX_LENGTH
        Length of the BMP portion of the index (stage 1) array.
        See Also:
        Constant Field Values
      • SURROGATE_MASK_

        protected static final int SURROGATE_MASK_
        Surrogate mask to use when shifting offset to retrieve supplementary values
        See Also:
        Constant Field Values
      • m_index_

        protected char[] m_index_
        Index or UTF16 characters
      • m_dataManipulate_

        protected Trie.DataManipulate m_dataManipulate_
        Internal TrieValue which handles the parsing of the data value. This class is to be implemented by the user
      • m_dataOffset_

        protected int m_dataOffset_
        Start index of the data portion of the trie. CharTrie combines index and data into a char array, so this is used to indicate the initial offset to the data portion. Note this index always points to the initial value.
      • m_dataLength_

        protected int m_dataLength_
        Length of the data array
      • HEADER_LENGTH_

        protected static final int HEADER_LENGTH_
        Size of Trie header in bytes
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_

        protected static final int HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_
        Latin 1 option mask
        See Also:
        Constant Field Values
      • HEADER_SIGNATURE_

        protected static final int HEADER_SIGNATURE_
        Constant number to authenticate the byte block
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_INDEX_SHIFT_

        protected static final int HEADER_OPTIONS_INDEX_SHIFT_
        See Also:
        Constant Field Values
      • HEADER_OPTIONS_DATA_IS_32_BIT_

        protected static final int HEADER_OPTIONS_DATA_IS_32_BIT_
        See Also:
        Constant Field Values
    • Constructor Detail

      • Trie

        protected Trie​(ByteBuffer bytes,
                       Trie.DataManipulate dataManipulate)
        Trie constructor for CharTrie use.
        Parameters:
        bytes - data of an ICU data file, containing the trie
        dataManipulate - object containing the information to parse the trie data
      • Trie

        protected Trie​(char[] index,
                       int options,
                       Trie.DataManipulate dataManipulate)
        Trie constructor
        Parameters:
        index - array to be used for index
        options - used by the trie
        dataManipulate - object containing the information to parse the trie data
    • Method Detail

      • isLatin1Linear

        public final boolean isLatin1Linear()
        Determines if this trie has a linear latin 1 array
        Returns:
        true if this trie has a linear latin 1 array, false otherwise
      • equals

        public boolean equals​(Object other)
        Checks if the argument Trie has the same data as this Trie. Attributes are checked but not the index data.
        Overrides:
        equals in class Object
        Parameters:
        other - Trie to check
        Returns:
        true if the argument Trie has the same data as this Trie, false otherwise
        See Also:
        Object.hashCode()
      • hashCode

        public int hashCode()
        Description copied from class: Object
        Returns an integer hash code for this object. By contract, any two objects for which Object.equals(java.lang.Object) returns true must return the same hash code value. This means that subclasses of Object usually override both methods or neither method.

        Note that hash values must not change over time unless information used in equals comparisons also changes.

        See Writing a correct hashCode method if you intend implementing your own hashCode method.

        Overrides:
        hashCode in class Object
        Returns:
        this object's hash code.
        See Also:
        Object.equals(java.lang.Object)
      • getSerializedDataSize

        public int getSerializedDataSize()
        Gets the serialized data file size of the Trie. This is used during trie data reading for size checking purposes.
        Returns:
        size size of serialized trie data file in terms of the number of bytes
      • getSurrogateOffset

        protected abstract int getSurrogateOffset​(char lead,
                                                  char trail)
        Gets the offset to the data which the surrogate pair points to.
        Parameters:
        lead - lead surrogate
        trail - trailing surrogate
        Returns:
        offset to data
      • getValue

        protected abstract int getValue​(int index)
        Gets the value at the argument index
        Parameters:
        index - value at index will be retrieved
        Returns:
        32 bit value
      • getInitialValue

        protected abstract int getInitialValue()
        Gets the default initial value
        Returns:
        32 bit value
      • getRawOffset

        protected final int getRawOffset​(int offset,
                                         char ch)
        Gets the offset to the data which the index ch after variable offset points to. Note for locating a non-supplementary character data offset, calling

        getRawOffset(0, ch);

        will do. Otherwise if it is a supplementary character formed by surrogates lead and trail. Then we would have to call getRawOffset() with getFoldingIndexOffset(). See getSurrogateOffset().
        Parameters:
        offset - index offset which ch is to start from
        ch - index to be used after offset
        Returns:
        offset to the data
      • getBMPOffset

        protected final int getBMPOffset​(char ch)
        Gets the offset to data which the BMP character points to Treats a lead surrogate as a normal code point.
        Parameters:
        ch - BMP character
        Returns:
        offset to data
      • getLeadOffset

        protected final int getLeadOffset​(char ch)
        Gets the offset to the data which this lead surrogate character points to. Data at the returned offset may contain folding offset information for the next trailing surrogate character.
        Parameters:
        ch - lead surrogate character
        Returns:
        offset to data
      • getCodePointOffset

        protected final int getCodePointOffset​(int ch)
        Internal trie getter from a code point. Could be faster(?) but longer with if((c32)<=0xd7ff) { (result)=_TRIE_GET_RAW(trie, data, 0, c32); } Gets the offset to data which the codepoint points to
        Parameters:
        ch - codepoint
        Returns:
        offset to data
      • unserialize

        protected void unserialize​(ByteBuffer bytes)

        Parses the byte buffer and creates the trie index with it.

        The position of the input ByteBuffer must be right after the trie header.

        This is overwritten by the child classes.

        Parameters:
        bytes - buffer containing trie data
      • isIntTrie

        protected final boolean isIntTrie()
        Determines if this is a 32 bit trie
        Returns:
        true if options specifies this is a 32 bit trie
      • isCharTrie

        protected final boolean isCharTrie()
        Determines if this is a 16 bit trie
        Returns:
        true if this is a 16 bit trie