Package org.biojava.utils.regex

This package is used to perform regular expression searches of SymbolLists defined in arbitrary Alphabets.

The implementation uses the java.util.regex package to perform the heavy lifting. Previous work had already defined a SymbolListCharSequence class to wrap SymbolLists and permit java.util.regex to be applied to the resultant CharSequence. This package extends this in two ways.

First, this package implements a SymbolTokenization for Alphabets that do not have one defined. This is done by arbitrarily mapping AtomicSymbols in the Alphabet to Unicode characters in the private range. The String that is required defining the regex can be assembled by calling PatternFactory.charValue() to return the unicode character value for Symbols in the Alphabet.

Next, the structure of the package has been changed to resemble more closely the classes in java.util.regex albeit adapted to SymbolLists. The APIs are very similar indeed.


It should be noted that ambiguity symbols in the pattern String will be expanded to specify each of the component Symbols as a variant Symbol for matching.

Also, ambiguity and gap symbols in the target SymbolList will be converted to the all-ambiguity symbol by the default tokenizers and so they can be expected to fail to match any specific symbol in the Pattern. This is most frequently the desired behaviour as blocks of N in DNA, for example, should not be matched against any possible pattern.