Package org.biojavax.bio.seq.io
Class UniProtFormat
- java.lang.Object
-
- org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
-
- org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
-
- org.biojavax.bio.seq.io.UniProtFormat
-
- All Implemented Interfaces:
SequenceFormat
,RichSequenceFormat
public class UniProtFormat extends RichSequenceFormat.HeaderlessFormat
Format reader for UniProt files. This version of UniProt format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.EMBLLikeFormat object. Since 1.7, the parser reads the International Protein Index (IPI) pseudo-Uniprot format.- Since:
- 1.5
- Author:
- Richard Holland, Mark Schreiber, George Waldon
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
UniProtFormat.Terms
Implements some UniProt-specific terms.-
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
-
-
Field Summary
Fields Modifier and Type Field Description protected static String
ACCESSION_TAG
protected static String
AUTHORS_TAG
protected static String
COMMENT_TAG
protected static String
CONSORTIUM_TAG
protected static String
DATABASE_XREF_TAG
protected static String
DATE_TAG
protected static String
DEFINITION_TAG
protected static Pattern
dp_ipi
protected static Pattern
dp_uniprot
protected static String
END_SEQUENCE_TAG
protected static String
FEATURE_TAG
protected static Pattern
fp
protected static String
GENE_TAG
protected static Pattern
headerLine
protected static String
KEYWORDS_TAG
protected static String
LOCATION_TAG
protected static String
LOCUS_TAG
protected static Pattern
lp_ipi
protected static Pattern
lp_uniprot
protected static String
ORGANELLE_TAG
protected static String
ORGANISM_TAG
protected static String
PROTEIN_EXIST_TAG
protected static String
RC_LINE_TAG
protected static String
REFERENCE_TAG
protected static String
REFERENCE_XREF_TAG
protected static String
RP_LINE_TAG
protected static Pattern
rppat
protected static String
SOURCE_TAG
protected static String
START_SEQUENCE_TAG
protected static String
TAXON_TAG
protected static String
TITLE_TAG
static String
UNIPROT_FORMAT
The name of this format
-
Constructor Summary
Constructors Constructor Description UniProtFormat()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
canRead(BufferedInputStream stream)
Check to see if a given stream is in our format.boolean
canRead(File file)
Check to see if a given file is in our format.String
getDefaultFormat()
getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.SymbolTokenization
guessSymbolTokenization(BufferedInputStream stream)
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.SymbolTokenization
guessSymbolTokenization(File file)
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.boolean
readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.boolean
readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
Read a sequence and pass data on to a SeqIOListener.void
writeSequence(Sequence seq, PrintStream os)
writeSequence
writes a sequence to the specified PrintStream, using the default format.void
writeSequence(Sequence seq, String format, PrintStream os)
writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.void
writeSequence(Sequence seq, Namespace ns)
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.-
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
beginWriting, finishWriting
-
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
-
-
-
-
Field Detail
-
UNIPROT_FORMAT
public static final String UNIPROT_FORMAT
The name of this format- See Also:
- Constant Field Values
-
LOCUS_TAG
protected static final String LOCUS_TAG
- See Also:
- Constant Field Values
-
ACCESSION_TAG
protected static final String ACCESSION_TAG
- See Also:
- Constant Field Values
-
DEFINITION_TAG
protected static final String DEFINITION_TAG
- See Also:
- Constant Field Values
-
DATE_TAG
protected static final String DATE_TAG
- See Also:
- Constant Field Values
-
SOURCE_TAG
protected static final String SOURCE_TAG
- See Also:
- Constant Field Values
-
ORGANELLE_TAG
protected static final String ORGANELLE_TAG
- See Also:
- Constant Field Values
-
ORGANISM_TAG
protected static final String ORGANISM_TAG
- See Also:
- Constant Field Values
-
TAXON_TAG
protected static final String TAXON_TAG
- See Also:
- Constant Field Values
-
GENE_TAG
protected static final String GENE_TAG
- See Also:
- Constant Field Values
-
DATABASE_XREF_TAG
protected static final String DATABASE_XREF_TAG
- See Also:
- Constant Field Values
-
PROTEIN_EXIST_TAG
protected static final String PROTEIN_EXIST_TAG
- See Also:
- Constant Field Values
-
REFERENCE_TAG
protected static final String REFERENCE_TAG
- See Also:
- Constant Field Values
-
RP_LINE_TAG
protected static final String RP_LINE_TAG
- See Also:
- Constant Field Values
-
REFERENCE_XREF_TAG
protected static final String REFERENCE_XREF_TAG
- See Also:
- Constant Field Values
-
AUTHORS_TAG
protected static final String AUTHORS_TAG
- See Also:
- Constant Field Values
-
CONSORTIUM_TAG
protected static final String CONSORTIUM_TAG
- See Also:
- Constant Field Values
-
TITLE_TAG
protected static final String TITLE_TAG
- See Also:
- Constant Field Values
-
LOCATION_TAG
protected static final String LOCATION_TAG
- See Also:
- Constant Field Values
-
RC_LINE_TAG
protected static final String RC_LINE_TAG
- See Also:
- Constant Field Values
-
KEYWORDS_TAG
protected static final String KEYWORDS_TAG
- See Also:
- Constant Field Values
-
COMMENT_TAG
protected static final String COMMENT_TAG
- See Also:
- Constant Field Values
-
FEATURE_TAG
protected static final String FEATURE_TAG
- See Also:
- Constant Field Values
-
START_SEQUENCE_TAG
protected static final String START_SEQUENCE_TAG
- See Also:
- Constant Field Values
-
END_SEQUENCE_TAG
protected static final String END_SEQUENCE_TAG
- See Also:
- Constant Field Values
-
lp_uniprot
protected static final Pattern lp_uniprot
-
dp_uniprot
protected static final Pattern dp_uniprot
-
headerLine
protected static final Pattern headerLine
-
-
Constructor Detail
-
UniProtFormat
public UniProtFormat()
-
-
Method Detail
-
canRead
public boolean canRead(File file) throws IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProt format if the first line matches the UniProt format for the ID line.- Specified by:
canRead
in interfaceRichSequenceFormat
- Overrides:
canRead
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
to check.- Returns:
- true if the file is readable by this format, false if not.
- Throws:
IOException
- in case the file is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(File file) throws IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Specified by:
guessSymbolTokenization
in interfaceRichSequenceFormat
- Overrides:
guessSymbolTokenization
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the file with. - Throws:
IOException
- if the file is unrecognisable or inaccessible.
-
canRead
public boolean canRead(BufferedInputStream stream) throws IOException
Check to see if a given stream is in our format. A stream is in UniProt format if the first line matches the UniProt format for the ID line.- Parameters:
stream
- theBufferedInputStream
to check.- Returns:
- true if the stream is readable by this format, false if not.
- Throws:
IOException
- in case the stream is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream) throws IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Parameters:
stream
- theBufferedInputStream
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the stream with. - Throws:
IOException
- if the stream is unrecognisable or inaccessible.
-
readSequence
public boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener) throws IllegalSymbolException, IOException, ParseException
Read a sequence and pass data on to a SeqIOListener.- Parameters:
reader
- The stream of data to parse.symParser
- A SymbolParser defining a mapping from character data to Symbols.listener
- A listener to notify when data is extracted from the stream.- Returns:
- a boolean indicating whether or not the stream contains any more sequences.
- Throws:
IllegalSymbolException
- if it is not possible to translate character data from the stream into valid BioJava symbols.IOException
- if an error occurs while reading from the stream.ParseException
-
readRichSequence
public boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns) throws IllegalSymbolException, IOException, ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.- Parameters:
reader
- the input sourcesymParser
- the tokenizer which understands the sequence being readrlistener
- the listener to send sequence events tons
- the namespace to read sequences into.- Returns:
- true if there is more to read after this, false otherwise.
- Throws:
IllegalSymbolException
- if the tokenizer couldn't understand one of the sequence symbols in the file.IOException
- if there was a read error.ParseException
-
writeSequence
public void writeSequence(Sequence seq, PrintStream os) throws IOException
writeSequence
writes a sequence to the specified PrintStream, using the default format.- Parameters:
seq
- the sequence to write out.os
- the printstream to write to.- Throws:
IOException
-
writeSequence
public void writeSequence(Sequence seq, String format, PrintStream os) throws IOException
writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.- Parameters:
seq
- aSequence
to write out.format
- aString
indicating which sub-format of those available from a particularSequenceFormat
implemention to use when writing.os
- aPrintStream
object.- Throws:
IOException
- if an error occurs.
-
writeSequence
public void writeSequence(Sequence seq, Namespace ns) throws IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as UniProt has no concept of it.- Parameters:
seq
- the sequence to writens
- the namespace to write it with- Throws:
IOException
- in case it couldn't write something
-
getDefaultFormat
public String getDefaultFormat()
getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.- Returns:
- a
String
.
-
-