Package org.biojavax.bio.seq.io
Class UniProtXMLFormat
- java.lang.Object
-
- org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
-
- org.biojavax.bio.seq.io.UniProtXMLFormat
-
- All Implemented Interfaces:
SequenceFormat
,RichSequenceFormat
public class UniProtXMLFormat extends RichSequenceFormat.BasicFormat
Format reader for UniProtXML files. This version of UniProtXML format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.GenbankXmlFormat object. Understands http://www.ebi.uniprot.org/support/docs/uniprot.xsd- Since:
- 1.5
- Author:
- Alan Li (code based on his work), Richard Holland
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
UniProtXMLFormat.Terms
Implements some UniProtXML-specific terms.-
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
-
-
Field Summary
-
Constructor Summary
Constructors Constructor Description UniProtXMLFormat()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
beginWriting()
Informs the writer that we want to start writing.boolean
canRead(BufferedInputStream stream)
Check to see if a given stream is in our format.boolean
canRead(File file)
Check to see if a given file is in our format.void
finishWriting()
Informs the writer that are done writing.String
getDefaultFormat()
getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.SymbolTokenization
guessSymbolTokenization(BufferedInputStream stream)
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.SymbolTokenization
guessSymbolTokenization(File file)
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.boolean
readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.boolean
readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
Read a sequence and pass data on to a SeqIOListener.void
writeSequence(Sequence seq, PrintStream os)
writeSequence
writes a sequence to the specified PrintStream, using the default format.void
writeSequence(Sequence seq, String format, PrintStream os)
writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.void
writeSequence(Sequence seq, Namespace ns)
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.-
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
-
-
-
-
Field Detail
-
UNIPROTXML_FORMAT
public static final String UNIPROTXML_FORMAT
The name of this format- See Also:
- Constant Field Values
-
ENTRY_GROUP_TAG
protected static final String ENTRY_GROUP_TAG
- See Also:
- Constant Field Values
-
ENTRY_TAG
protected static final String ENTRY_TAG
- See Also:
- Constant Field Values
-
ENTRY_VERSION_ATTR
protected static final String ENTRY_VERSION_ATTR
- See Also:
- Constant Field Values
-
ENTRY_NAMESPACE_ATTR
protected static final String ENTRY_NAMESPACE_ATTR
- See Also:
- Constant Field Values
-
ENTRY_CREATED_ATTR
protected static final String ENTRY_CREATED_ATTR
- See Also:
- Constant Field Values
-
ENTRY_UPDATED_ATTR
protected static final String ENTRY_UPDATED_ATTR
- See Also:
- Constant Field Values
-
COPYRIGHT_TAG
protected static final String COPYRIGHT_TAG
- See Also:
- Constant Field Values
-
ACCESSION_TAG
protected static final String ACCESSION_TAG
- See Also:
- Constant Field Values
-
NAME_TAG
protected static final String NAME_TAG
- See Also:
- Constant Field Values
-
TEXT_TAG
protected static final String TEXT_TAG
- See Also:
- Constant Field Values
-
REF_ATTR
protected static final String REF_ATTR
- See Also:
- Constant Field Values
-
TYPE_ATTR
protected static final String TYPE_ATTR
- See Also:
- Constant Field Values
-
KEY_ATTR
protected static final String KEY_ATTR
- See Also:
- Constant Field Values
-
ID_ATTR
protected static final String ID_ATTR
- See Also:
- Constant Field Values
-
EVIDENCE_ATTR
protected static final String EVIDENCE_ATTR
- See Also:
- Constant Field Values
-
VALUE_ATTR
protected static final String VALUE_ATTR
- See Also:
- Constant Field Values
-
STATUS_ATTR
protected static final String STATUS_ATTR
- See Also:
- Constant Field Values
-
NAME_ATTR
protected static final String NAME_ATTR
- See Also:
- Constant Field Values
-
PROTEIN_TAG
protected static final String PROTEIN_TAG
- See Also:
- Constant Field Values
-
PROTEIN_TYPE_ATTR
protected static final String PROTEIN_TYPE_ATTR
- See Also:
- Constant Field Values
-
DOMAIN_TAG
protected static final String DOMAIN_TAG
- See Also:
- Constant Field Values
-
COMPONENT_TAG
protected static final String COMPONENT_TAG
- See Also:
- Constant Field Values
-
GENE_TAG
protected static final String GENE_TAG
- See Also:
- Constant Field Values
-
ORGANISM_TAG
protected static final String ORGANISM_TAG
- See Also:
- Constant Field Values
-
DBXREF_TAG
protected static final String DBXREF_TAG
- See Also:
- Constant Field Values
-
PROPERTY_TAG
protected static final String PROPERTY_TAG
- See Also:
- Constant Field Values
-
LINEAGE_TAG
protected static final String LINEAGE_TAG
- See Also:
- Constant Field Values
-
TAXON_TAG
protected static final String TAXON_TAG
- See Also:
- Constant Field Values
-
GENELOCATION_TAG
protected static final String GENELOCATION_TAG
- See Also:
- Constant Field Values
-
GENELOCATION_NAME_TAG
protected static final String GENELOCATION_NAME_TAG
- See Also:
- Constant Field Values
-
REFERENCE_TAG
protected static final String REFERENCE_TAG
- See Also:
- Constant Field Values
-
CITATION_TAG
protected static final String CITATION_TAG
- See Also:
- Constant Field Values
-
TITLE_TAG
protected static final String TITLE_TAG
- See Also:
- Constant Field Values
-
EDITOR_LIST_TAG
protected static final String EDITOR_LIST_TAG
- See Also:
- Constant Field Values
-
AUTHOR_LIST_TAG
protected static final String AUTHOR_LIST_TAG
- See Also:
- Constant Field Values
-
PERSON_TAG
protected static final String PERSON_TAG
- See Also:
- Constant Field Values
-
CONSORTIUM_TAG
protected static final String CONSORTIUM_TAG
- See Also:
- Constant Field Values
-
LOCATOR_TAG
protected static final String LOCATOR_TAG
- See Also:
- Constant Field Values
-
RP_LINE_TAG
protected static final String RP_LINE_TAG
- See Also:
- Constant Field Values
-
RC_LINE_TAG
protected static final String RC_LINE_TAG
- See Also:
- Constant Field Values
-
RC_SPECIES_TAG
protected static final String RC_SPECIES_TAG
- See Also:
- Constant Field Values
-
RC_TISSUE_TAG
protected static final String RC_TISSUE_TAG
- See Also:
- Constant Field Values
-
RC_TRANSP_TAG
protected static final String RC_TRANSP_TAG
- See Also:
- Constant Field Values
-
RC_STRAIN_TAG
protected static final String RC_STRAIN_TAG
- See Also:
- Constant Field Values
-
RC_PLASMID_TAG
protected static final String RC_PLASMID_TAG
- See Also:
- Constant Field Values
-
COMMENT_TAG
protected static final String COMMENT_TAG
- See Also:
- Constant Field Values
-
COMMENT_MASS_ATTR
protected static final String COMMENT_MASS_ATTR
- See Also:
- Constant Field Values
-
COMMENT_ERROR_ATTR
protected static final String COMMENT_ERROR_ATTR
- See Also:
- Constant Field Values
-
COMMENT_METHOD_ATTR
protected static final String COMMENT_METHOD_ATTR
- See Also:
- Constant Field Values
-
COMMENT_LOCTYPE_ATTR
protected static final String COMMENT_LOCTYPE_ATTR
- See Also:
- Constant Field Values
-
COMMENT_ABSORPTION_TAG
protected static final String COMMENT_ABSORPTION_TAG
- See Also:
- Constant Field Values
-
COMMENT_ABS_MAX_TAG
protected static final String COMMENT_ABS_MAX_TAG
- See Also:
- Constant Field Values
-
COMMENT_KINETICS_TAG
protected static final String COMMENT_KINETICS_TAG
- See Also:
- Constant Field Values
-
COMMENT_KIN_KM_TAG
protected static final String COMMENT_KIN_KM_TAG
- See Also:
- Constant Field Values
-
COMMENT_KIN_VMAX_TAG
protected static final String COMMENT_KIN_VMAX_TAG
- See Also:
- Constant Field Values
-
COMMENT_PH_TAG
protected static final String COMMENT_PH_TAG
- See Also:
- Constant Field Values
-
COMMENT_REDOX_TAG
protected static final String COMMENT_REDOX_TAG
- See Also:
- Constant Field Values
-
COMMENT_TEMPERATURE_TAG
protected static final String COMMENT_TEMPERATURE_TAG
- See Also:
- Constant Field Values
-
COMMENT_LINK_TAG
protected static final String COMMENT_LINK_TAG
- See Also:
- Constant Field Values
-
COMMENT_LINK_URI_ATTR
protected static final String COMMENT_LINK_URI_ATTR
- See Also:
- Constant Field Values
-
COMMENT_EVENT_TAG
protected static final String COMMENT_EVENT_TAG
- See Also:
- Constant Field Values
-
COMMENT_ISOFORM_TAG
protected static final String COMMENT_ISOFORM_TAG
- See Also:
- Constant Field Values
-
COMMENT_INTERACTANT_TAG
protected static final String COMMENT_INTERACTANT_TAG
- See Also:
- Constant Field Values
-
COMMENT_INTERACT_INTACT_ATTR
protected static final String COMMENT_INTERACT_INTACT_ATTR
- See Also:
- Constant Field Values
-
COMMENT_INTERACT_LABEL_TAG
protected static final String COMMENT_INTERACT_LABEL_TAG
- See Also:
- Constant Field Values
-
COMMENT_ORGANISMS_TAG
protected static final String COMMENT_ORGANISMS_TAG
- See Also:
- Constant Field Values
-
COMMENT_EXPERIMENTS_TAG
protected static final String COMMENT_EXPERIMENTS_TAG
- See Also:
- Constant Field Values
-
NOTE_TAG
protected static final String NOTE_TAG
- See Also:
- Constant Field Values
-
KEYWORD_TAG
protected static final String KEYWORD_TAG
- See Also:
- Constant Field Values
-
PROTEIN_EXISTS_TAG
protected static final String PROTEIN_EXISTS_TAG
- See Also:
- Constant Field Values
-
ID_TAG
protected static final String ID_TAG
- See Also:
- Constant Field Values
-
FEATURE_TAG
protected static final String FEATURE_TAG
- See Also:
- Constant Field Values
-
FEATURE_DESC_ATTR
protected static final String FEATURE_DESC_ATTR
- See Also:
- Constant Field Values
-
FEATURE_ORIGINAL_TAG
protected static final String FEATURE_ORIGINAL_TAG
- See Also:
- Constant Field Values
-
FEATURE_VARIATION_TAG
protected static final String FEATURE_VARIATION_TAG
- See Also:
- Constant Field Values
-
EVIDENCE_TAG
protected static final String EVIDENCE_TAG
- See Also:
- Constant Field Values
-
EVIDENCE_CATEGORY_ATTR
protected static final String EVIDENCE_CATEGORY_ATTR
- See Also:
- Constant Field Values
-
EVIDENCE_ATTRIBUTE_ATTR
protected static final String EVIDENCE_ATTRIBUTE_ATTR
- See Also:
- Constant Field Values
-
EVIDENCE_DATE_ATTR
protected static final String EVIDENCE_DATE_ATTR
- See Also:
- Constant Field Values
-
LOCATION_TAG
protected static final String LOCATION_TAG
- See Also:
- Constant Field Values
-
LOCATION_SEQ_ATTR
protected static final String LOCATION_SEQ_ATTR
- See Also:
- Constant Field Values
-
LOCATION_BEGIN_TAG
protected static final String LOCATION_BEGIN_TAG
- See Also:
- Constant Field Values
-
LOCATION_END_TAG
protected static final String LOCATION_END_TAG
- See Also:
- Constant Field Values
-
LOCATION_POSITION_ATTR
protected static final String LOCATION_POSITION_ATTR
- See Also:
- Constant Field Values
-
LOCATION_POSITION_TAG
protected static final String LOCATION_POSITION_TAG
- See Also:
- Constant Field Values
-
SEQUENCE_TAG
protected static final String SEQUENCE_TAG
- See Also:
- Constant Field Values
-
SEQUENCE_VERSION_ATTR
protected static final String SEQUENCE_VERSION_ATTR
- See Also:
- Constant Field Values
-
SEQUENCE_LENGTH_ATTR
protected static final String SEQUENCE_LENGTH_ATTR
- See Also:
- Constant Field Values
-
SEQUENCE_MASS_ATTR
protected static final String SEQUENCE_MASS_ATTR
- See Also:
- Constant Field Values
-
SEQUENCE_CHECKSUM_ATTR
protected static final String SEQUENCE_CHECKSUM_ATTR
- See Also:
- Constant Field Values
-
SEQUENCE_MODIFIED_ATTR
protected static final String SEQUENCE_MODIFIED_ATTR
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
UniProtXMLFormat
public UniProtXMLFormat()
-
-
Method Detail
-
canRead
public boolean canRead(File file) throws IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".- Specified by:
canRead
in interfaceRichSequenceFormat
- Overrides:
canRead
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
to check.- Returns:
- true if the file is readable by this format, false if not.
- Throws:
IOException
- in case the file is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(File file) throws IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Specified by:
guessSymbolTokenization
in interfaceRichSequenceFormat
- Overrides:
guessSymbolTokenization
in classRichSequenceFormat.BasicFormat
- Parameters:
file
- theFile
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the file with. - Throws:
IOException
- if the file is unrecognisable or inaccessible.
-
canRead
public boolean canRead(BufferedInputStream stream) throws IOException
Check to see if a given stream is in our format. A stream is in UniProtXML format if the second XML line contains the phrase "http://www.uniprot.org/support/docs/uniprot.xsd".- Parameters:
stream
- theBufferedInputStream
to check.- Returns:
- true if the stream is readable by this format, false if not.
- Throws:
IOException
- in case the stream is inaccessible.
-
guessSymbolTokenization
public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream) throws IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.- Parameters:
stream
- theBufferedInputStream
object to guess the format of.- Returns:
- a
SymbolTokenization
to read the stream with. - Throws:
IOException
- if the stream is unrecognisable or inaccessible.
-
readSequence
public boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener) throws IllegalSymbolException, IOException, ParseException
Read a sequence and pass data on to a SeqIOListener.- Parameters:
reader
- The stream of data to parse.symParser
- A SymbolParser defining a mapping from character data to Symbols.listener
- A listener to notify when data is extracted from the stream.- Returns:
- a boolean indicating whether or not the stream contains any more sequences.
- Throws:
IllegalSymbolException
- if it is not possible to translate character data from the stream into valid BioJava symbols.IOException
- if an error occurs while reading from the stream.ParseException
-
readRichSequence
public boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns) throws IllegalSymbolException, IOException, ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface. If namespace is null, then the namespace of the sequence in the fasta is used. If the namespace is null and so is the namespace of the sequence in the fasta, then the default namespace is used.- Parameters:
reader
- the input sourcesymParser
- the tokenizer which understands the sequence being readrlistener
- the listener to send sequence events tons
- the namespace to read sequences into.- Returns:
- true if there is more to read after this, false otherwise.
- Throws:
IllegalSymbolException
- if the tokenizer couldn't understand one of the sequence symbols in the file.IOException
- if there was a read error.ParseException
-
beginWriting
public void beginWriting() throws IOException
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.- Throws:
IOException
- if writing fails.
-
finishWriting
public void finishWriting() throws IOException
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.- Throws:
IOException
- if writing fails.
-
writeSequence
public void writeSequence(Sequence seq, PrintStream os) throws IOException
writeSequence
writes a sequence to the specified PrintStream, using the default format.- Parameters:
seq
- the sequence to write out.os
- the printstream to write to.- Throws:
IOException
-
writeSequence
public void writeSequence(Sequence seq, String format, PrintStream os) throws IOException
writeSequence
writes a sequence to the specifiedPrintStream
, using the specified format.- Parameters:
seq
- aSequence
to write out.format
- aString
indicating which sub-format of those available from a particularSequenceFormat
implemention to use when writing.os
- aPrintStream
object.- Throws:
IOException
- if an error occurs.
-
writeSequence
public void writeSequence(Sequence seq, Namespace ns) throws IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! If namespace is null, then the sequence's own namespace is used.- Parameters:
seq
- the sequence to writens
- the namespace to write it with- Throws:
IOException
- in case it couldn't write something
-
getDefaultFormat
public String getDefaultFormat()
getDefaultFormat
returns the String identifier for the default sub-format written by aSequenceFormat
implementation.- Returns:
- a
String
.
-
-