java.lang.Object
- org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
- - org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
  - - org.biojavax.bio.seq.io.UniProtFormat

All Implemented Interfaces:

SequenceFormat, RichSequenceFormat
```
public class UniProtFormat
extends RichSequenceFormat.HeaderlessFormat
```
Format reader for UniProt files. This version of UniProt format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.EMBLLikeFormat object. Since 1.7, the parser reads the International Protein Index (IPI) pseudo-Uniprot format.

Since:

1.5

Author:

Richard Holland, Mark Schreiber, George Waldon

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class UniProtFormat.Terms
Implements some UniProt-specific terms.
- Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
  RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat

Field Summary

Fields
Modifier and Type	Field	Description
`protected static String`	`ACCESSION_TAG`
`protected static String`	`AUTHORS_TAG`
`protected static String`	`COMMENT_TAG`
`protected static String`	`CONSORTIUM_TAG`
`protected static String`	`DATABASE_XREF_TAG`
`protected static String`	`DATE_TAG`
`protected static String`	`DEFINITION_TAG`
`protected static Pattern`	`dp_ipi`
`protected static Pattern`	`dp_uniprot`
`protected static String`	`END_SEQUENCE_TAG`
`protected static String`	`FEATURE_TAG`
`protected static Pattern`	`fp`
`protected static String`	`GENE_TAG`
`protected static Pattern`	`headerLine`
`protected static String`	`KEYWORDS_TAG`
`protected static String`	`LOCATION_TAG`
`protected static String`	`LOCUS_TAG`
`protected static Pattern`	`lp_ipi`
`protected static Pattern`	`lp_uniprot`
`protected static String`	`ORGANELLE_TAG`
`protected static String`	`ORGANISM_TAG`
`protected static String`	`PROTEIN_EXIST_TAG`
`protected static String`	`RC_LINE_TAG`
`protected static String`	`REFERENCE_TAG`
`protected static String`	`REFERENCE_XREF_TAG`
`protected static String`	`RP_LINE_TAG`
`protected static Pattern`	`rppat`
`protected static String`	`SOURCE_TAG`
`protected static String`	`START_SEQUENCE_TAG`
`protected static String`	`TAXON_TAG`
`protected static String`	`TITLE_TAG`
`static String`	`UNIPROT_FORMAT`	The name of this format

Constructor Summary

Constructors
Constructor Description

UniProtFormat()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`boolean`	`canRead(BufferedInputStream stream)`	Check to see if a given stream is in our format.
`boolean`	`canRead(File file)`	Check to see if a given file is in our format.
`String`	`getDefaultFormat()`	`getDefaultFormat` returns the String identifier for the default sub-format written by a `SequenceFormat` implementation.
`SymbolTokenization`	`guessSymbolTokenization(BufferedInputStream stream)`	On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
`SymbolTokenization`	`guessSymbolTokenization(File file)`	On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
`boolean`	`readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)`	Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.
`boolean`	`readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)`	Read a sequence and pass data on to a SeqIOListener.
`void`	`writeSequence(Sequence seq, PrintStream os)`	`writeSequence` writes a sequence to the specified PrintStream, using the default format.
`void`	`writeSequence(Sequence seq, String format, PrintStream os)`	`writeSequence` writes a sequence to the specified `PrintStream`, using the specified format.
`void`	`writeSequence(Sequence seq, Namespace ns)`	Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.

Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
beginWriting, finishWriting

Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - UNIPROT_FORMAT
```
public static final String UNIPROT_FORMAT
```
    The name of this format
    
    See Also:
    
    Constant Field Values
  - LOCUS_TAG
```
protected static final String LOCUS_TAG
```
    See Also:
    
    Constant Field Values
  - ACCESSION_TAG
```
protected static final String ACCESSION_TAG
```
    See Also:
    
    Constant Field Values
  - DEFINITION_TAG
```
protected static final String DEFINITION_TAG
```
    See Also:
    
    Constant Field Values
  - DATE_TAG
```
protected static final String DATE_TAG
```
    See Also:
    
    Constant Field Values
  - SOURCE_TAG
```
protected static final String SOURCE_TAG
```
    See Also:
    
    Constant Field Values
  - ORGANELLE_TAG
```
protected static final String ORGANELLE_TAG
```
    See Also:
    
    Constant Field Values
  - ORGANISM_TAG
```
protected static final String ORGANISM_TAG
```
    See Also:
    
    Constant Field Values
  - TAXON_TAG
```
protected static final String TAXON_TAG
```
    See Also:
    
    Constant Field Values
  - GENE_TAG
```
protected static final String GENE_TAG
```
    See Also:
    
    Constant Field Values
  - DATABASE_XREF_TAG
```
protected static final String DATABASE_XREF_TAG
```
    See Also:
    
    Constant Field Values
  - PROTEIN_EXIST_TAG
```
protected static final String PROTEIN_EXIST_TAG
```
    See Also:
    
    Constant Field Values
  - REFERENCE_TAG
```
protected static final String REFERENCE_TAG
```
    See Also:
    
    Constant Field Values
  - RP_LINE_TAG
```
protected static final String RP_LINE_TAG
```
    See Also:
    
    Constant Field Values
  - REFERENCE_XREF_TAG
```
protected static final String REFERENCE_XREF_TAG
```
    See Also:
    
    Constant Field Values
  - AUTHORS_TAG
```
protected static final String AUTHORS_TAG
```
    See Also:
    
    Constant Field Values
  - CONSORTIUM_TAG
```
protected static final String CONSORTIUM_TAG
```
    See Also:
    
    Constant Field Values
  - TITLE_TAG
```
protected static final String TITLE_TAG
```
    See Also:
    
    Constant Field Values
  - LOCATION_TAG
```
protected static final String LOCATION_TAG
```
    See Also:
    
    Constant Field Values
  - RC_LINE_TAG
```
protected static final String RC_LINE_TAG
```
    See Also:
    
    Constant Field Values
  - KEYWORDS_TAG
```
protected static final String KEYWORDS_TAG
```
    See Also:
    
    Constant Field Values
  - COMMENT_TAG
```
protected static final String COMMENT_TAG
```
    See Also:
    
    Constant Field Values
  - FEATURE_TAG
```
protected static final String FEATURE_TAG
```
    See Also:
    
    Constant Field Values
  - START_SEQUENCE_TAG
```
protected static final String START_SEQUENCE_TAG
```
    See Also:
    
    Constant Field Values
  - END_SEQUENCE_TAG
```
protected static final String END_SEQUENCE_TAG
```
    See Also:
    
    Constant Field Values
  - lp_uniprot
```
protected static final Pattern lp_uniprot
```
  - lp_ipi
```
protected static final Pattern lp_ipi
```
  - rppat
```
protected static final Pattern rppat
```
  - dp_uniprot
```
protected static final Pattern dp_uniprot
```
  - dp_ipi
```
protected static final Pattern dp_ipi
```
  - fp
```
protected static final Pattern fp
```
  - headerLine
```
protected static final Pattern headerLine
```
- Constructor Detail
  - UniProtFormat
```
public UniProtFormat()
```
- Method Detail
  - canRead
```
public boolean canRead(File file)
                throws IOException
```
    Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProt format if the first line matches the UniProt format for the ID line.
    
    Specified by:
    
    canRead in interface RichSequenceFormat
    
    Overrides:
    
    canRead in class RichSequenceFormat.BasicFormat
    
    Parameters:
    
    file - the File to check.
    
    Returns:
    
    true if the file is readable by this format, false if not.
    
    Throws:
    
    IOException - in case the file is inaccessible.
  - guessSymbolTokenization
```
public SymbolTokenization guessSymbolTokenization(File file)
                                           throws IOException
```
    On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.
    
    Specified by:
    
    guessSymbolTokenization in interface RichSequenceFormat
    
    Overrides:
    
    guessSymbolTokenization in class RichSequenceFormat.BasicFormat
    
    Parameters:
    
    file - the File object to guess the format of.
    
    Returns:
    
    a SymbolTokenization to read the file with.
    
    Throws:
    
    IOException - if the file is unrecognisable or inaccessible.
  - canRead
```
public boolean canRead(BufferedInputStream stream)
                throws IOException
```
    Check to see if a given stream is in our format. A stream is in UniProt format if the first line matches the UniProt format for the ID line.
    
    Parameters:
    
    stream - the BufferedInputStream to check.
    
    Returns:
    
    true if the stream is readable by this format, false if not.
    
    Throws:
    
    IOException - in case the stream is inaccessible.
  - guessSymbolTokenization
```
public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
                                           throws IOException
```
    On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.
    
    Parameters:
    
    stream - the BufferedInputStream object to guess the format of.
    
    Returns:
    
    a SymbolTokenization to read the stream with.
    
    Throws:
    
    IOException - if the stream is unrecognisable or inaccessible.
  - readSequence
```
public boolean readSequence(BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            IOException,
                            ParseException
```
    Read a sequence and pass data on to a SeqIOListener.
    
    Parameters:
    
    reader - The stream of data to parse.
    
    symParser - A SymbolParser defining a mapping from character data to Symbols.
    
    listener - A listener to notify when data is extracted from the stream.
    
    Returns:
    
    a boolean indicating whether or not the stream contains any more sequences.
    
    Throws:
    
    IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
    
    IOException - if an error occurs while reading from the stream.
    
    ParseException
  - readRichSequence
```
public boolean readRichSequence(BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                IOException,
                                ParseException
```
    Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.
    
    Parameters:
    
    reader - the input source
    
    symParser - the tokenizer which understands the sequence being read
    
    rlistener - the listener to send sequence events to
    
    ns - the namespace to read sequences into.
    
    Returns:
    
    true if there is more to read after this, false otherwise.
    
    Throws:
    
    IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
    
    IOException - if there was a read error.
    
    ParseException
  - writeSequence
```
public void writeSequence(Sequence seq,
                          PrintStream os)
                   throws IOException
```
    writeSequence writes a sequence to the specified PrintStream, using the default format.
    
    Parameters:
    
    seq - the sequence to write out.
    
    os - the printstream to write to.
    
    Throws:
    
    IOException
  - writeSequence
```
public void writeSequence(Sequence seq,
                          String format,
                          PrintStream os)
                   throws IOException
```
    writeSequence writes a sequence to the specified PrintStream, using the specified format.
    
    Parameters:
    
    seq - a Sequence to write out.
    
    format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
    
    os - a PrintStream object.
    
    Throws:
    
    IOException - if an error occurs.
  - writeSequence
```
public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws IOException
```
    Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as UniProt has no concept of it.
    
    Parameters:
    
    seq - the sequence to write
    
    ns - the namespace to write it with
    
    Throws:
    
    IOException - in case it couldn't write something
  - getDefaultFormat
```
public String getDefaultFormat()
```
    getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
    
    Returns:
    
    a String.

Class UniProtFormat

Nested Class Summary

Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat

Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat

Methods inherited from class java.lang.Object

Field Detail

UNIPROT_FORMAT

LOCUS_TAG

ACCESSION_TAG

DEFINITION_TAG

DATE_TAG

SOURCE_TAG

ORGANELLE_TAG

ORGANISM_TAG

TAXON_TAG

GENE_TAG

DATABASE_XREF_TAG

PROTEIN_EXIST_TAG

REFERENCE_TAG

RP_LINE_TAG

REFERENCE_XREF_TAG

AUTHORS_TAG

CONSORTIUM_TAG

TITLE_TAG

LOCATION_TAG

RC_LINE_TAG

KEYWORDS_TAG

COMMENT_TAG

FEATURE_TAG

START_SEQUENCE_TAG

END_SEQUENCE_TAG

lp_uniprot

lp_ipi

rppat

dp_uniprot

dp_ipi

fp

headerLine

Constructor Detail

UniProtFormat

Method Detail

canRead

guessSymbolTokenization

canRead

guessSymbolTokenization

readSequence

readRichSequence

writeSequence

writeSequence

writeSequence

getDefaultFormat