Class UkkonenSuffixTree


  • public class UkkonenSuffixTree
    extends Object

    A suffix tree is an efficient method for encoding the frequencies of motifs in a sequence. They are sometimes used to quickly screen for similar sequences. For instance, all motifs of length up to 2 in the sequence AAGT could be encoded as:

     root(4)
     |
     A(2)--------G(1)-----T(1)
     |           |
     A(1)--G(1)  T(1)
     

    It supports addition of elements both as String and SymbolList. They should not be mixed together. The strings are also terminated internally, so it is possible to go and add more than one string to the suffix tree.

    Some more work need to be done on how data should be generated from this class. If you need something that's not in there, please e-mail the list at biojava-dev@biojava.org and I'll add it in there. <\p>

    Version:
    1.3
    Author:
    Francois Pepin
    • Method Detail

      • addSequence

        public void addSequence​(String seq,
                                String name,
                                boolean doNotTerminate)
        Add a sequence into the tree. If there are more sequences, they should be separated by a terminationChar ($ by default). If none exist, it is assumed that there is only 1 continuous sequence to be added.
        Parameters:
        seq - the sequence/sequences to be added into the tree.
        doNotTerminate - whether we should terminate the sequence if it's non-terminated.
      • walkTo

        public UkkonenSuffixTree.SuffixNode walkTo​(UkkonenSuffixTree.SuffixNode starting,
                                                   String source,
                                                   int from,
                                                   int to)
        This method is used to walk down the tree, from a given node. The rule variable can be used to check where the walk stopped. Note that rule 3 means that the string used to walk down the tree does not match (which is a bit different from the construction where rule 3 implies that only the last character doesn't match.

        The String is encoded as a substring of a given source. This is done to avoid replicating the string. To send walk down the string x from the root, one would call walkTo(root,x,0,x.length()).

        Parameters:
        starting - the root of the subtree we're walking down form.
        source - a superstring that contains the string we're using to walking down. source.subtring(from,to) should give the string we're walking down from.
        from - the start position (inclusive) of the target string in the source.
        to - the end position (exclusive) of the target string in the node.
        Returns:
        a SuffixNode that the walk stopped at. If the walk stopped inside an edge. (check the rule variable to see where it stopped).
      • jumpTo

        public UkkonenSuffixTree.SuffixNode jumpTo​(UkkonenSuffixTree.SuffixNode starting,
                                                   CharSequence source,
                                                   int from,
                                                   int to)
        Just like walkTo, but faster when used during tree construction, as it assumes that a mismatch can only occurs with the last character of the target string.
        Parameters:
        starting - the root of the subtree we're walking down form.
        source - a superstring that contains the string we're using to walking down. source.subtring(from,to) should give the string we're walking down from.
        from - the start position (inclusive) of the target string in the source.
        to - the end position (exclusive) of the target string in the node.
        Returns:
        a SuffixNode that the walk stopped at. If the walk stopped inside an edge. (check the rule variable to see where it stopped).