001/*
002 *                    BioJava development code
003 *
004 * This code may be freely distributed and modified under the
005 * terms of the GNU Lesser General Public Licence.  This should
006 * be distributed with the code.  If you do not have a copy,
007 * see:
008 *
009 *      http://www.gnu.org/copyleft/lesser.html
010 *
011 * Copyright for this code is held jointly by the individual
012 * authors.  These should be listed in @author doc comments.
013 *
014 * For more information on the BioJava project and its aims,
015 * or to join the biojava-l mailing list, visit the home page
016 * at:
017 *
018 *      http://www.biojava.org/
019 *
020 */
021
022
023package org.biojava.bio.symbol;
024
025import org.biojava.bio.Annotatable;
026
027/**
028 * A single symbol.
029 * <p>
030 * This is the atomic unit of a SymbolList, or a sequence. It allows
031 * for fine-grain fly-weighting, so that there can be one instance
032 * of each symbol that is referenced multiple times.
033 * <p>
034 * Symbols from finite alphabets are identifiable using the == operator.
035 * Symbols from infinite alphabets may have some specific API to test for
036 * equality, but should realy over-ride the equals() method.
037 * <p>
038 * Some symbols represent a single token in the sequence. For example, there is
039 * a Symbol instance for adenine in DNA, and another one for cytosine.
040 * Symbols can potentialy represent sets of Symbols. For example, n represents
041 * any DNA Symbol, and X any protein Symbol. Gap represents the knowledge that
042 * there is no Symbol. In addition, some symbols represent ordered lists of
043 * other Symbols. For example, the codon agt can be represented by a single
044 * Symbol from the Alphabet DNAxDNAxDNA. Symbols can represent ambiguity over
045 * these complex symbols. For example, you could construct a Symbol instance
046 * that represents the codons atn. This matches the codons {ata, att, atg, atc}.
047 * It is also possible to build a Symbol instance that represents all stop
048 * codons {taa, tag, tga}, which can not be represented in terms of a
049 * single ambiguous n'tuple.
050 * <p>
051 * There are three Symbol interfaces. Symbol is the most generic. It has the
052 * methods getToken and getName so that the Symbol can be textually represented.
053 * In addition, it defines getMatches that returns an Alphabet over all the
054 * AtomicSymbol instances that match the Symbol (N would return an Alphabet
055 * containing {A, G, C, T}, and Gap would return {}).
056 * <p>
057 * BasisSymbol instances can always be represented by an n'tuple of BasisSymbol
058 * instances. It adds the method getSymbols so that you can retrieve this list.
059 * For example, the tuple [ant] is a BasisSymbol, as it is uniquely specified
060 * with those three BasisSymbol instances a, n and t. n is a BasisSymbol
061 * instance as it is uniquely represented by itself.
062 * <p>
063 * AtomicSymbol instances specialize BasisSymbol by guaranteeing that getMatches
064 * returns a set containing only that instance. That is, they are indivisable.
065 * The DNA nucleotides are instances of AtomicSymbol, as are individual codons.
066 * The stop codon {tag} will have a getMatches method that returns {tag},
067 * a getBases method that also returns {tag} and a getSymbols method that returns
068 * the List [t, a, g]. {tna} is a BasisSymbol but not an AtomicSymbol as it
069 * matches four AtomicSymbol instances {taa, tga, tca, tta}. It follows that
070 * each symbol in getSymbols for an AtomicSymbol instance will also be
071 * AtomicSymbol instances.
072 *
073 * @author Matthew Pocock
074 */
075public interface Symbol extends Annotatable {
076  /**
077   * The long name for the symbol.
078   *
079   * @return  the long name
080   */
081  String getName();
082  
083  /**
084   * The alphabet containing the symbols matched by this ambiguity symbol.
085   * <p>
086   * This alphabet contains all of, and only, the symbols matched by this
087   * symbol. For example, the symbol representing the DNA
088   * ambiguity code for W would contain the symbol for A and T from the DNA
089   * alphabet.
090   *
091   * @return  the Alphabet of symbols matched by this
092   *          symbol
093   */
094  Alphabet getMatches();
095}