BioJava:BioJavaXDocs

From BioJava

Jump to: navigation, search

Contents

BioJavaX is not BioJava 2 is not BioJavaX.

BioJavaX is an extension to the existing BioJava project. Anything written with BioJava will work with BioJavaX, and vice versa.

org.biojavax is to org.biojava as javax is to java.

The BioJava2 project is a completely new project which intends to rewrite everything in BioJava from scratch, based around a new set of object designs and concepts. It is entirely incompatible with the existing BioJava project.

Therefore BioJavaX is not BioJava 2, and has nothing to do with it. Please don't get them confused!

What didn't change?

Existing interfaces.

Backwards-compatibility is always an issue when a major new version of a piece of software is released.

BioJavaX addresses this by keeping all the new classes and interfaces tucked away inside their own special package, org.biojavax. None of the existing interfaces were modified in any way, so any code which depends on them will not see any difference.

Apart from ongoing bugfixes, the way in which the existing classes work also has not changed.

The new interfaces introduced in BioJavaX extend those present in the existing BioJava packages. This allows new BioJavaX-derived objects to be passed to legacy code and still be understood.

Change listeners.

BioJava's change listener model is intact and unchanged. The new BioJavaX classes define a set of extra change types which they fire in addition to the ones generated by existing BioJava classes.

This means that existing change listeners can be attached to BioJavaX-derived objects and still receive all the information they would normally receive.

Event-based file parsing.

BioJavaX still uses event-based file parsing to read and write files, in exactly the same way as the old BioJava classes did.

However, you cannot use existing event listeners with the new BioJavaX file parsers. You must alter the listeners to extend the new org.biojavax.bio.seq.io.RichSeqIOListener interface instead.


What did change?

System requirements.

Java 1.4 is required for all BioJavaX packages.

Rich interfaces.

BioJavaX defines a new set of interfaces for working with sequence objects. These interfaces are closely modelled on the BioSQL 1.0 schema.

The new interfaces extend existing interfaces wherever possible, in order to allow backwards-compatibility with legacy code. These interfaces are known as rich interfaces, as they could be said to be 'enriched' versions of the interfaces that they extend.

Instances of implementing classes are known as rich objects, which legacy instances known as plain ones.

Here is a list of the new rich interfaces:

   ComparableOntology (extends Ontology)
   ComparableTerm (extends Term)
   ComparableTriple (extends Triple)
   RichSequenceIterator (extends SequenceIterator)
   RichSequence (extends Sequence)
   RichLocation (extends Location)
   RichFeature (extends StrandedFeature)
   RichFeatureHolder (extends FeatureHolder)
   RichAnnotatable (extends Annotatable)
   RichAnnotation (extends Annotation)
   BioSQLFeatureFilter (extends FeatureFilter)
   RichSequenceDB (extends SequenceDB)

Wherever possible in BioJavaX, conversions are attempted if a method expecting a rich object receives a plain one. You can perform these conversions yourself by using the Tools sub-class of the appropriate rich interface, for example to convert an old Sequence object into a new RichSequence object, you can do this:

Sequence s = ...; // get an old Sequence object from somewhere RichSequence rs = RichSequence.Tools.enrich(s);

The conversion process does its best, but it is not perfect. Much of the way information is stored in the new BioJavaX object model is fundamentally incompatible with the old object model. So its always best to deal with RichSequence objects from the word go and try to avoid instantiating older Sequence objects as far as possible.

Other new interfaces define new concepts, or replace old interfaces entirely due to a fundamental clash in the way they see the world. Here is a list:

   NCBITaxon
   BioEntry
   RichObjectBuilder
   RichSequenceHandler
   Comment
   CrossRef
   CrossReferenceResolver
   DocRef
   DocRefAuthor
   Namespace
   Note
   RankedCrossRef
   RankedCrossRefable
   RankedDocRef
   BioEntryRelationship
   Position
   PositionResolver
   RichFeatureRelationship
   BioEntryDB


BioSQL persistence.

BioJavaX introduces a whole new way of working with BioSQL databases.

Instead of attempting to re-invent the wheel with yet another new object-relational mapping system, BioJavaX uses the services of Hibernate to do all the dirty work for it. In fact, there is not a single SQL statement anywhere in the BioJavaX code.

The use of Hibernate allows users to have as much or as little control as they like over transactions and query optimisation. The Hibernate query language, HQL, is simple to learn and easy to use.

You can find out more about the Hibernate project at their website: www.hibernate.org/

Better file parsers.

The old BioJava file parsers worked in that they loaded all information into memory, but they didn't do much at attempting to understand the contents of the files, and they often failed miserably when trying to convert between formats.

The new parsers supplied with BioJavaX put a lot of effort into trying to fit data from the myriad of file formats out there into a form representable by BioSQL, and hence by the new BioJavaX object model. Of course this isn't always possible, but it does a much better job than the old ones.

By parsing data into a fixed object model instead of storing everything as annotations (as was the case, for instance, with the old SwissProt parsers), conversion between file formats becomes much easier.

The new file parsers also allow you to skip uninteresting parts of the file altogether, greatly speeding up simple tasks such as counting the number of sequences in a file.

NCBI Taxonomy loader.

A parser is provided for loading the NCBI Taxonomy database into a set of BioJavaX NCBITaxon objects. This parser reads the node.dmp and names.dmp files supplied by NCBI and constructs the appropriate hierarchy of objects. If you are using BioSQL, it can persist this hierarchy to the database as it goes.

Namespaces.

All sequences in BioJavaX must belong to a namespace.

Singletons.

BioJavaX tries to use singletons as far as possible. This is:

  • to reduce memory usage.
  • to prevent problems with duplicate keys when persisting to BioSQL.

The singletons are kept in a LRU cache managed by a RichObjectFactory. See the chapter on this subject later in this book.

Genetic algorithms.

BioJavaX introduces a new package for working with genetic algorithms.

Future plans.

BioPerl and BioPerl-DB compatibility.

We tried our best to store sequence data into BioSQL in the same way as BioPerl-DB does. We also tried to parse files in such a way that data from files would end up in the same place in BioSQL as if it had been parsed using the BioPerl file parsers then persisted using BioPerl-DB.

However, we may not have been entirely successful, particularly with regard to the naming conventions of annotations and feature qualifiers, and the use of the document and publication cross-reference tables. Likewise, our definition of fuzzy locations may differ.

So, we intend in the future to try and consolidate our efforts with those of the BioPerl and BioPerl-DB projects, along with any of the other Bio* projects who provide BioSQL persistence functionality, so that we can all read and write data to and from BioSQL in the same way.

The goal is to be able to read a file with any of the Bio* projects, persist it to the database, then read it back from the database using any of the other Bio* projects and write it out to file. The input and output files should be logically identical (give or take some minor layout or formatting issues).

Help is needed!

Efficient parsing.

The event-based parser model works great, but our implementations of actual file parsing code may leave a lot to be desired in terms of efficient use of memory or minimising the number of uses of markers in the input stream.

If you are an IO, parsing, or code optimisation guru, you would be most welcome to come have a look and speed things up a bit.

More file formats supported.

We've provided parsers (and writers) for all the major formats we thought would be necessary. But there are only two of us, and it takes a while to trawl through the documentation for each format and try to shoehorn it all into the BioSQL model, even before the actual coding begins.

If there's a format you like and use daily and you think would be of use to others, but you can't find it in BioJavaX, then please do write a parser for it and contribute it to the project.

Persistence to non-BioSQL databases.

Basically, right now, you can't. We have only provided Hibernate mappings for BioSQL.

There is no reason though why you can't write a new set of Hibernate XML mapping files that map the BioJavaX objects into tables in some other database format. Because of the way Hibernate works, you wouldn't have to change any of the BioJavaX code at all, only the mapping files that tell Hibernate how to translate between objects and tables.

If you do, and you think someone else could benefit from your work, please consider contributing them to the BioJava project for everyone to enjoy. 5. Java 1.5 and Generics.

Much discussion has occurred recently about upgrading BioJava to use features only available since version 1.5 of Java (also known as Java 5). Mostly we are considering the use of generics.

A lot of this started after some Java 1.5 features accidentally slipped into the biojava-live CVS branch one day and suddenly nobody using older JVMs could compile it any more. These were quickly removed, and it was agreed to wait a while before a decision was made about the ultimate use of such features.

Java 1.5 offers a lot of features that would be very useful in BioJava, and has the potential to greatly reduce the size of the project's codebase. However, 1.5 compilers and runtime environments are not available for some platforms yet, and in other situations companies are reluctant to upgrade when they have already settled on 1.4 as their tested and accepted Java environment.

So, we won't do it yet, but we would definitely like to change in future.

Singletons and the RichObjectFactory.

Using RichObjectFactory.

BioJavaX revolves around the use of singleton instances. This is important to keep memory usage down, and becomes even more important when working with BioSQL databases via Hibernate to prevent duplicate records in tables. Singletons are generated in a singleton factory.

RichObjectFactory is a caching singleton factory. If you request lots of instances of the same class, the oldest ones are forgotten about and you will get a new instance next time you ask for it. This is to prevent memory blowouts. The default size of this LRU cache is 20 instances of each class.

Singletons are only important when dealing with certain classes:

     SimpleNamespace
     SimpleComparableOntology
     SimpleNCBITaxon
     SimpleCrossRef
     SimpleDocRef

In all other cases, you don't need to worry about singletons. In fact, the singleton factory may complain if you try to ask it to make a singleton of any class not listed above.

To generate a new instance of any of the above, you must use the RichObjectFactory. This tool checks an LRU cache to see if you have requested an identical instance recently. If you have, it returns that instance (a singleton). If you haven't, then it creates the instance, adds it to the LRU cache, then returns it.

The parameters you supply to the RichObjectFactory are a class name, and an array of parameters which you would normally have passed directly to that class' constructor. Here is a list of the parameters required, and an example, for each of the classes accepted by the current factory:

Table 5.1. RichObjectFactory singleton examples.

Objects Parameters Example
SimpleNamespace [name (String)] Namespace ns = (Namespace)RichObjectFactory.getObject(SimpleNamespace.class,new Object[]{"myNamespace"});
SimpleComparableOntology [name (String)] ComparableOntology ont = (ComparableOntology)RichObjectFactory.getObject(ComparableOntology.class,new Object[]{"myOntology"});
SimpleNCBITaxon [taxID (Integer)] Integer taxID = new Integer(12345);

NCBITaxon tax = (NCBITaxon)RichObjectFactory.getObject(SimpleNCBITaxon.class,new Object[]{taxID});

SimpleCrossRef [databaseName (String), accession (String), version (Integer)] Integer version = new Integer(0);

CrossRef cr = (CrossRef)RichObjectFactory.getObject( SimpleCrossRef.class, new Object[]{"PUBMED","56789",version} );

SimpleDocRef [authors (List of DocRefAuthor), location (String)] DocRefAuthor author = new SimpleDocRefAuthor("Bloggs,J.");

List authors = new ArrayList();
authors.add(author);
DocRef dr = (DocRef)RichObjectFactory.getObject( SimpleDocRef.class, new Object[]{authors,"Journal of Voodoo Virology, 2005, 23:55-57"});

Where the singletons come from.

The actual instances of the classes requested are generated using a RichObjectBuilder. The default RichObjectBuilder, SimpleRichObjectBuilder, uses introspection to call the constructors on the classes and create new instances. You do not need to do anything to set this up.

If you do decide to write your own RichObjectBuilder for whatever reason, you can set it to be used by RichObjectFactory like this:

RichObjectBuilder builder = ...; // create your own one here
RichObjectFactory.setRichObjectBuilder(builder); // make the factory use it from now on 

If you change the default RichObjectBuilder to a different one, you must do so at the very beginning of your program before any call to the RichObjectFactory has been made. This is because when the builder is changed, existing singletons or default instances are not removed. If you do not follow this guideline, you will end up with a mix of objects in the cache created by two different builders, which could lead to interesting situations.

Hibernate singletons.

When working with Hibernate, you must connect BioJavaX to Hibernate by calling RichObjectFactory.connectToBioSQL(session) and passing it your session object. When using this, instances are looked up in the underlying BioSQL database first to see if they exist. If they do, they are loaded and returned. If not, they are created, then returned.

The instances returned by RichObjectFactory when connected to Hibernate are guaranteed true singletons and will never be duplicated even if you fill up the LRU cache several times between requests.

You can replicate the behaviour of RichObjectFactory.connectToBioSQL(session) by instantiating BioSQLRichObjectBuilder and BioSQLCrossReferenceResolver objects and passing these to the appropriate methods in RichObjectFactory.

See the section on BioSQL and Hibernate later in this document for more details.

Managing the LRU cache.

By default, the LRU cache keeps the 20 most recently requested instances of any given class in memory. If more than 20 objects are requested, the oldest ones are removed from the cache before the new ones are added. This keeps memory usage at a minimum.

If you are experiencing problems with duplicate instances when you expected singletons., or believe that a larger or smaller cache may help the performance of your application, then you can change the size of the LRU cache. There are two ways of doing this.

Changes to the LRU cache size are not instantaneous. The size of the cache only changes physically next time an instance is requested from it. Even then, only the cache of instances of the class requested will actually change.

Global LRU cache size.

Changing the global LRU cache size will change the cache size for all classes. It applies the new cache size to every single class. Next time any of those classes are accessed via the RichObjectFactory, the LRU cache for that class will adjust to the new size.

RichObjectFactory.setLRUCacheSize(50); // increases the global LRU cache size to 50 instances per class 

Class-specific LRU cache size.

Changing the LRU cache size for a specific class will only affect that class. Your class-specific settings will be lost if you later change the global LRU cache size.

RichObjectFactory.setLRUCacheSize(SimpleNamespace.class, 50); // increases the LRU cache for SimpleNamespace instances to 50

Convenience methods

A number of convenience methods are provided by the RichObjectFactory to allow easy access to some useful default singletons:

RichObjectFactory convenience methods.

Name of method Use
void setDefaultNamespaceName(String name) Sets the name of the default namespace. This namespace is used when loading files which have no namespace information of their own, and when no namespace has been passed to the file loading routines. It can also be used when creating temporary RichSequence or BioEntry objects, as the namespace parameter is compulsory on these objects.
Namespace getDefaultNamespace(); Returns the default namespace singleton instance (delegates to getObject()).
void setDefaultOntologyName(String name); Sets the name of the default ontology. When parsing files, new terms are often created. If the file format does not have an ontology of its own, then it will use the default ontology to store these terms. Terms commonly used throughout BioJavaX, including those common to all file formats, are also stored in the default ontology.
ComparableOntology getDefaultOntology(); Returns the default ontology singleton instance (delegates to getObject()).
void setDefaultPositionResolver(PositionResolver pr); When converting fuzzy locations into actual physical locations, a PositionResolver instance is used. The default one is AveragePositionResolver, which averages out the range of fuzziness to provide a value somewhere in the middle. You can override this setting using this function. All locations that are resolved without explicility specifying a PositionResolver to use will then use this resolver to do the work.
PositionResolver getDefaultPositionResolver(); Returns the default position resolver.
void setDefaultCrossReferenceResolver(CrossReferenceResolver cr); CrossRef instances are links to other databases. When a CrossRef is used in a RichLocation instance, it means that to obtain the symbols (sequence) for that location, it must first retrieve the remote sequence object. The CrossReferenceResolver object specified using this method is used to carry this out. The default implementation of this interface DummyCrossReferenceResolver, which always returns infinitely ambiguous symbol lists and cannot look up any remote sequence objects. Use BioSQLCrossReferenceResolver instead (or use RichObjectFactory.connectToBioSQL(session)) if you are using Hibernate, which is able to actually look up the sequences (if they exist in your database).
CrossReferenceResolver getDefaultCrossReferenceResolver(); Returns the default cross reference resolver.
void setDefaultRichSequenceHandler(RichSequenceHandler rh); Calls to RichSequence methods which reference sequence data will delegate to this handler to carry the requests out. The default implementation is a DummyRichSequenceHandler, which just uses the internal SymbolList of the RichSequence to look up the data. When this is set to a BioSQLRichSequenceHandler, the handler will go to the database to look up the information instead of keeping an in-memory copy of it.
RichSequenceHandler getDefaultRichSequenceHandler(); Returns the default rich sequence handler.
void connectToBioSQL(Object session); Instantiates BioSQLCrossReferenceResolver, BioSQLRichObjectBuilder and BioSQLRichSequenceHandler using the Hibernate session object provided, and sets these objects as the default instances. After this call, the factory will try to look up all object requests in the underlying database first.

Default settings.

The default namespace name is lcl.

The default ontology name is biojavax.

The default LRU cache size is 20.

The default position resolver is AveragePositionResolver.

The default cross reference resolver is DummyCrossReferenceResolver.

The default rich sequence handler is DummyRichSequenceHandler.


Working with sequences.

Creating sequences.

BioJavaX has a two-tier definition of sequence data.

BioEntry objects correspond to the bioentry table in BioSQL. They do not have any sequence information, and neither do they have any features. They can, however, be annotated, commented, and put into relationships with each other. They can also have cross-references to publications and other databases associated with them.

RichSequence objects extend BioEntry objects by adding in sequence data and a feature table.

So, when to use them?

  • BioEntry objects are most useful when performing simple operations such as counting sequences, checking taxonomy data, looking up accessions, or finding out things like which objects refer to a particular PUBMED entry.
  • RichSequence objects are useful only when you need access to the sequence data itself, or to the sequence feature table.
  • RichSequence objects must be used whenever you wish to pass objects to legacy code that is expecting Sequence objects, as only RichSequence objects implement the Sequence interface. BioEntry objects do not.

Throughout the rest of this document, both BioEntry and RichSequence objects will be referred to interchangeably as sequence objects.

To create a BioEntry object, you need to have at least the following information:

  • a Namespace instance to associate the sequence with (use RichObjectFactory.getDefaultNamespace() for an easy way out)
  • a name for the sequence
  • an accession for the sequence
  • a version for the sequence (use 0 if you don't want to bother with versions)

To create a RichSequence object, you need to have all the above plus:

  • a SymbolList containing the sequence data
  • a version for the sequence data (this is separate from the version of the sequence object)

Multiple accessions

If you wish to assign multiple accessions to a sequence, you must do so using the special term provided, like this:

ComparableTerm accTerm = RichSequence.Terms.getAdditionalAccessionTerm();
Note accession1 = new SimpleNote(accTerm,"A12345",1); // this note has an arbitrary rank of 1
Note accession2 = new SimpleNote(accTerm,"Z56789",2); // this note has an arbitrary rank of 2
...
RichSequence rs = ...; // get a rich sequence from somewhere
rs.getNoteSet().add(accession1); // annotate the rich sequence with the first additional accession
rs.getNoteSet().add(accession2); // annotate the rich sequence with the second additional accession
...
// you can annotate bioentry objects in exactly the same way
BioEntry be = ...; // get a bioentry from somewhere
be.getNoteSet().add(accession1); 
be.getNoteSet().add(accession2);

See later in this document for more information on how to annotate and comment on sequences.

Circular sequences

BioJavaX can flag sequences as being circular, using the setCircular() and getCircular() methods on RichSequence instances. However, as this information is not part of BioSQL, it will be lost when the sequence is persisted to a BioSQL database. Use with care.

Note that only circular sequences can have features with circular locations associated with them.

Relationships between sequences.

Relating two sequences

Two sequences can be related to each other by using a BioEntryRelationship object to construct the link.

Relationships are optionally ranked. If you don't want to rank the relationship, use null in the constructor.

The following code snippet defines a new term "contains" in the default ontology, then creates a relationship that states that sequence A (the parent) contains sequence B (the child):

ComparableTerm contains = RichObjectFactory.getDefaultOntology().getOrCreateTerm("contains");
...
RichSequence parent = ...; // get sequence A from somewhere 
RichSequence child = ...; // get sequence B from somewhere
BioEntryRelationship relationship = new SimpleBioEntryRelationship(parent,child,contains,null);
parent.addRelationship(relationship); // add the relationship to the parent
...
parent.removeRelationship(relationship); // you can always take it away again later 

Querying the relationship

Sequences are only aware of relationships in which they are the parent sequence. A child sequence cannot find out which parent sequences it is related to.

The following code snippet prints out all the relationships a sequence has with child sequences:

RichSequence rs = ...; // get a rich sequence from somewhere
for (Iterator i = rs.getRelationships().iterator(); i.hasNext(); ) {
     BioEntryRelationship br = (BioEntryRelationship)i.next();
     BioEntry parent = br.getObject(); // parent == rs
     BioEntry child = br.getSubject(); 
     ComparableTerm relationship = br.getTerm();
     // print out the relationship (eg. "A contains B");
     System.out.println(parent.getName()+" "+relationship.getName()+" "+child.getName());
}

Reading and writing files.

Tools for reading/writing files

BioJavaX provides a replacement set of tools for working with files. This is necessary because the new file parsers must work with the new RichSeqIOListener in order to preserve all the information from the file correctly.

The tools can all be found in RichSequence.IOTools, a subclass of the RichSequence interface. For each file format there are a number of utility methods in this class for reading a variety of sequence types, and writing them out again. See later sections of this chapter for details on individual formats.

Here is an example of using the RichSequence.IOTools methods. The example reads a file in Genbank format containing some DNA sequences, then prints them out to standard out (the screen) in EMBL format:

// an input GenBank file
BufferedReader br = new BufferedReader(new FileReader("myGenbank.gbk"));  
// a namespace to override that in the file
Namespace ns = RichObjectFactory.getDefaultNamespace();                   
// we are reading DNA sequences
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);   
while (seqs.hasNext()) {
    RichSequence rs = seqs.nextRichSequence();
    // write it in EMBL format to standard out
    RichSequence.IOTools.writeEMBL(System.out, rs, ns);                   
}

If you wish to output a number of sequences in one of the XML formats, you have to pass a RichSequenceIterator over your collection of sequences in order for the XML format to group them together into a single file with the correct headers:

// an input GenBank file
BufferedReader br = new BufferedReader(new FileReader("myGenbank.gbk"));  
// a namespace to override that in the file
Namespace ns = RichObjectFactory.getDefaultNamespace();                   
// we are reading DNA sequences
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);   
// write the whole lot in EMBLxml format to standard out
RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns);

If you don't know what format your input file is in, but know it could be one of a fixed set of acceptable formats, then you can use BioJavaX's format-guessing routine to attempt to read it:

// Not sure if your input is EMBL or Genbank? Load them both here.
Class.forName("org.biojavax.bio.seq.io.EMBLFormat");
Class.forName("org.biojavax.bio.seq.io.GenbankFormat");
 
// Now let BioJavaX guess which format you actually should use (using the default namespace)
Namespace ns = RichObjectFactory.getDefaultNamespace();         
RichSequenceIterator seqs = RichSequence.IOTools.readFile(new File("myfile.seq"),ns);

For those who like to do things the hard way, reading and writing by directly using the RichStreamReader and RichStreamWriter interfaces is described below.

Reading using RichStreamReader

File reading is based around the concept of a RichStreamReader. This object returns a RichSequenceIterator which iterates over every sequence in the file on demand.

To construct a RichStreamReader, you will need five things.

  1. a BufferedReader instance which is connected to the file you wish to parse;
  2. a RichSequenceFormat instance which understands the format of the file (eg. FastaFormat, GenbankFormat, etc.);
  3. a SymbolTokenization which understands how to translate the sequence data in the file into a BioJava SymbolList;
  4. a RichSequenceBuilderFactory instance which generates instances of RichSequenceBuilder;
  5. a Namespace instance to associate the sequences with.

The RichSequenceBuilderFactory is best set to one of the predefined constants in the RichSequenceBuilderFactory interface. These constants are defined as:

Table 8.1. RichSequenceBuilderFactory predefined constants.

Name of constant What it will do
RichSequenceBuilderFactor.FACTORY Does not attempt any compression on sequence data.
RichSequenceBuilderFactor.PACKED Will compress all sequence data using PackedSymbolLists.
RichSequenceBuilderFactor.THRESHOLD Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed.

If you set the namespace to null, then the namespace used will depend on the format you are reading. For formats which specify namespaces, the namespace from the file will be used. For formats which do not specify namespaces, the default namespace provided by RichObjectFactory.getDefaultNamespace() will be used.

The SymbolTokenization should be obtained from the Alphabet that represents the sequence data you are expecting from the file. If you are reading DNA sequences, you should use DNATools.getDNA().getTokenization("token"). Other alphabets with tools classes will have similar methods.

For an alphabet which does not have a tools class, you can do this:

Alphabet a = ...; // get an alphabet instance from somewhere
SymbolTokenization st = a.getTokenization("token");

Writing using RichStreamWriter

File output is done using RichStreamWriter. This requires:

  1. An OutputStream to write sequences to.
  2. A Namespace to use for the sequences.
  3. A RichSequenceIterator that provides the sequences to write.

The namespace should only be specified when the file format includes namespace information and you wish to override the information associated with the actual sequences. If you do not wish to do this, just set it to null, and the namespace from each individual sequence will be used instead.

The RichSequenceIterator is an iterator over a set of sequences, exactly the same as the one returned by the RichStreamReader. It is therefore possible to plug a RichStreamReader directly into a RichStreamWriter and convert data from one file format to another with no intermediate steps.

If you only have one sequence to write, you can wrap it in a temporary RichSequenceIterator by using a call like this:

RichSequence rs = ...; // get sequence from somewhere
RichSequenceIterator it = new SingleRichSeqIterator(rs); // wrap it in an iterator 

Example

The following is an example that will read some DNA sequences from a GenBank file and write them out to standard output (screen) as FASTA using the methods outlined above:

// sequences will be DNA sequences
SymbolTokenization dna = DNATools.getDNA().getTokenization("token");        
// read Genbank
RichSequenceFormat genbank = new GenbankFormat();                           
// write FASTA
RichSequenceFormat fasta = new FastaFormat();                               
// compress only longer sequences
RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.THRESHOLD;  
// read/write everything using the 'bloggs' namespace
Namespace bloggsNS = RichObjectFactory.getObject(
                        SimpleNamespace.class, 
                        new Object[]{"bloggs"} 
                     );                                                     
 
// read seqs from "mygenbank.file"
BufferedReader input = new BufferedReader(new FileReader("mygenbank.file"));
// write seqs to STDOUT
OutputStream output = System.out;                                           
 
RichStreamReader seqsIn = new RichStreamReader(input,genbank,dna,factory,bloggsNS);
RichStreamWriter seqsOut = new RichStreamWriter(output,fasta);
// one-step Genbank to Fasta conversion!
seqsOut.writeStream(seqsIn,bloggsNS);

Line widths and eliding information

When working at this level, extra methods can be used when direct access to the RichSequenceFormat object is available. These methods are:

Table 8.2. RichSequenceFormat extra options.

Name of method What it will do
get/setLineWidth() Sets the line width for output. Any lines longer than this will be wrapped. The default for most formats is 80.
get/setElideSymbols() When set to true, this will skip the sequence data (ie. the addSymbols() method of the RichSeqIOListener will never be called).
get/setElideFeatures() When set to true, this will skip the feature tables in the file.
get/setElideComments() When set to true, this will skip all comments in the file.
get/setElideReferences() When set to true, this will skip all publication cross-references in the file.

Finer control is available when you go even deeper and write your own RichSeqIOListener objects. See later in this document for information on that subject.

How parsed data becomes a sequence.

All fields read from a file, regardless of the format, are passed to an instance of RichSequenceBuilder. In the case of the tools provided in RichSequence.IOTools, or any RichStreamReader using one of the RichSequenceBuilderFactory constants or SimpleRichSequenceBuilderFactory, this is an instance of SimpleRichSequenceBuilder.

SimpleRichSequenceBuilder constructs sequences as follows:

Table 8.3. SimpleRichSequenceBuilder sequence construction.

Name of method What it will do
startSequence Resets all the values in the builder to their defaults, ready to parse a whole new sequence.
addSequenceProperty Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the sequence with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the sequence using setNoteSet() and the accumulated set of notes.
setVersion setVersion method.
setURI Not implemented, throws an exception.
setSeqVersion Only accepts a single call per sequence. Value is parsed into a double and passed to the resulting sequence's setSeqVersion method. If the value is null, then 0.0 is used.
setAccession Value is passed directly to the sequence's setAccession method. Multiple calls will replace the accession, not add extra ones. The accession cannot be null.
setDescription Only accepts a single call per sequence. Value is passed directly to the resulting sequence's setDescription method.
setDivision Only accepts a single call per sequence. Value is passed directly to the resulting sequence's setDivision method. The division cannot be null.
setIdentifier Only accepts a single call per sequence. Value is passed directly to the resulting sequence's setIdentifier method.
setName Only accepts a single call per sequence. Value is passed directly to the resulting sequence's setName method.
setNamespace Only accepts a single call per sequence. Value is passed directly to the resulting sequence's setNamespace method. The namespace cannot be null.
setComment Adds the text supplied (which must not be null) as a comment to the sequence using addComment(). Multiple calls will result in multiple comments being added. The first comment is ranked 1, the second comment ranked 2, and so on.
setTaxon Value is passed to the sequence's setNamespace method. It must not be null. If this method is called repeatedly, only the first call will be accepted. Subsequent calls will result in warnings being printed to standard error. These extra calls will not cause the builder to fail. The value from the initial call will be the one that is used.
startFeature Tells the builder to start a new feature on this sequence. If the current feature has not yet been ended, then this feature will be a sub-feature of the current feature and associated with it via a RichFeatureRelationship, where the current feature is the parent and this new feature is the child. The relationship will be defined with the term "contains" from RichObjectFactory.getDefaultOntology(). Each feature will be attached to the resulting sequence by calling setParent() on the feature once the sequence has been created.
getCurrentFeature Returns the current feature, if one has been started. If there is no current feature (eg. it has already ended, or one was never started) then an exception is thrown.
addFeatureProperty Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the current feature with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the feature using getAnnotation().addNote().
endFeature Ends the current feature. If there is no current feature, an exception is thrown.
setRankedDocRef Adds the given RankedDocRef to the set of publication cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedDocRef() on the resulting sequence.
setRankedCrossRef Adds the given RankedCrossRef to the set of database cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedCrossRef() on the resulting sequence.

setRelationship Adds the given BioEntryRelationship to the set of relationships in which the sequence being built is the parent. The relationship cannot be null. If the same relationship is provided multiple times, it will only be saved once. Each relationship is stored by calling addRelationship() on the resulting sequence.

setCircular You can call this as many times as you like. Each call will override the value provided by the previous call. The value is passed to the sequence's setCircular method.
addSymbols Adds symbols to this sequence. You can call it multiple times to set symbols at different locations in the sequence. If any of the symbols found are not in the alphabet accepted by this builder, or if the locations provided to place the symbols at are unacceptable, an exception is thrown. The resulting SymbolList will be the basis upon which the final RichSequence object is built.
endSequence Tells the builder that we have provided all the information we know. If at this point the name, namespace, or accession have not been provided, or if any of them are null, an exception is thrown.
makeSequence Constructs a RichSequence object from the information provided, following the rules laid out in this table, and returns it. The RichSequence object does not actually exist until this method has been called.
makeRichSequence Wrapper for makeSequence.

If you want fine-grained control over every aspect of a file whilst it is being parsed, you must write your own implementation of the RichSeqIOListener interface (which RichSequenceBuilder extends). This is detailed later in this document.

FASTA

FastaFormat reads and writes FASTA files, and is able to parse the description line in detail.

Reading

The description line formats understood are as follows:

>gi|<identifier>|<namespace>|<accession>.<version>|<name> <description> >gi|<identifier>|<namespace>|<accession>|<name> <description>
><namespace>|<accession>.<version>|<name> <description>
><namespace>|<accession>|<name> <description>

><name> <description>

The description is optional in all cases. The version defaults to 0 if not provided.

If a non-null Namespace is provided, then the namespace in the file is ignored.

If a null Namespace is provided, then the namespace from the file is used. If no namespace is specified in the file, then RichObjectFactory.getDefaultNamespace() is used.

The fields are passed into the RichSeqIOListener as follows:

Table 8.4. FastaFormat input field destinations.

FASTA Info type Method used to set info
identifier setIdentifier()
namespace setNamespace()
accession setAccession()
version setVersion()
name setName()
description setDescription()
<sequence data> addSymbols()

Writing

Description lines are always output in one of two forms:

>gi|<identifier>|<namespace>|<accession>.<version>|<name> <description>
><namespace>|<accession>.<version>|<name> <description>

In the case that the accession number and the name are identicle then the <name> is omitted.

The first form is used if the identifier of the sequence object is not null, otherwise the second form is used. In both cases, the description is only output if it is not null.

The fields are read from the RichSequence object as follows:

Table 8.5. FastaFormat output field sources.

FASTA Info type Method used to get info
identifier getIdentifier()
namespace getNamespace()
accession getAccession()
version getVersion()
name getName()
description getDescription()
<sequence data> Sequence is read directly as it is a SymbolList.

GenBank

GenbankFormat reads and writes GenBank files, and understands almost all permutations of the location descriptors found in the feature tables.

Reading

The fields are passed into the RichSeqIOListener as follows:

Table 8.6. GenBankFormat input field destinations.

GenBank Field How is it processed?
LOCUS setName(), addSequenceProperty(Terms.getStrandedTerm()), setCircular(), addSequenceProperty(Terms.getMolTypeTerm()), addSequenceProperty(Terms.getDateUpdatedTerm()), and setDivision().
DEFINITION setDescription()
ACCESSION The first one is passed to setAccession(). Subsequent entries are passed to addSequenceProperty(Terms.getAdditionalAccessionTerm()).
VERSION The section before the full stop "." is passed to setAccession(). If it differs from the first accession on the ACCESSION line, then the first accession on the ACCESSION line becomes an additional accession, whilst the accession from the VERSION line becomes the primary accession. The section after the full stop is passed to setVersion(). The GI number is passed to setIdentifier().
KEYWORDS The line is split up into individual keywords, each of which is passed to addSequenceProperty(Terms.getKeywordTerm()).
SOURCE Ignored.
ORGANISM Ignored.
REFERENCE The coordinates of the reference end up as start and end coordinates of a SimpleRankedDocRef object which is attached to the sequence by calling setRankedDocRef().
AUTHORS The value is parsed into a set of DocRefAuthor objects using DocRefAuthor.Tools. The resulting set becomes part of the DocRef object which is wrapped using a SimpleRankedDocRef and attached to the sequence.
TITLE The title is passed to the current DocRef object using setTitle().
JOURNAL The journal is passed to the current DocRef object using setLocation().
PUBMED A RankedCrossRef object is created pointing to Terms.PUBMED_KEY as the database, and using this value as the accession with a version of 0. It is attached to the sequence using setRankedCrossRef(). If no MEDLINE line is found, this is also associated with the current reference by using setCrossRef() on the DocRef object.
MEDLINE Behaves similarly to PUBMED, but with a database name of Terms.MEDLINE_KEY. It takes precedence over PUBMED and will always be used for the DocRef cross-reference.
REMARK Added to the current reference by calling setRemark() on the DocRef object.
COMMENT setComment()
FEATURES Each feature is started by calling startFeature(). The source is Terms.getGenBankTerm() whereas the type is obtained from RichObjectFactory.getDefaultOntology().getOrCreateTerm() using the feature name. Qualifiers are added by using addFeatureProperty() with the term key created by RichObjectFactory.getDefaultOntology().getOrCreateTerm() using the qualifier name. There are two special cases of qualifier: db_xref, and organism. Neither end up being stored as qualifiers. A database cross-reference is created for db_xref qualifiers and added to the feature using addRankedCrossRef(), except when the feature type is source and the database name (before the colon) is taxon, in which case the taxon ID is used in conjunction with the organism qualifier to determine the NCBITaxon for this sequence, and passed to the sequence using setTaxon(). Location strings are run through GenBankLocationParser to generate RichLocation instances to attach to the feature.
BASE Ignored.
ORIGIN The sequence is read and passed to addSymbols().


Writing

The fields are read from the RichSequence object as follows:

Table 8.7. GenBankFormat output field sources.

GenBank Field How is it outputted?
LOCUS getName(), length(), getNoteSet(Terms.getStrandedTerm()), getNoteSet(Terms.getMolTypeTerm()), getCircular(), getDivision(), and getNoteSet(Terms.getDateUpdatedTerm())
DEFINITION getDescription()
ACCESSION getAccession(), and getNoteSet(Terms.getAdditionalAccessionTerm()).
VERSION getAccession(), getIdentifier() and getVersion()
KEYWORDS getNoteSet(Terms.getKeywordTerm()).
SOURCE getTaxon().getDisplayName()
ORGANISM getTaxon()getDisplayName(), chopped before the first bracket, and getTaxon().getNameHierarchy()
REFERENCE Each reference is obtained from getRankedDocRefs(). The coordinates of the reference are from the reference's getStart() and getEnd() methods.
AUTHORS The author string is from the reference's getAuthors() method.
TITLE The title is from the reference's getTitle().
JOURNAL The journal information is from the reference's getLocation().
PUBMED / MEDLINE The cross reference returned by getCrossRef() on the reference provides the database name and accession used here.
REMARK getRemark() on the current reference object.
COMMENT All the comments returned by getComments() are joined together, separated by newlines.
FEATURES Each feature is output in turn by iterating through getFeatureSet(). For the source feature, the db_xref and organism fields are added to the output by calling getTaxon().getNCBITaxID() and getTaxon().getDisplayName() on the sequence (the latter is chopped before the first bracket if necessary). For all features, extra db_xref qualifiers are output for each cross-reference returned by calling getRankedCrossRefs() on the feature. The other qualifiers for the features are the contents of the feature's annotation, provided by getNoteSet() on the feature. GenBankLocationParser is used to convert the feature's getLocation() output into the correct text format.
BASE Calculated from the sequence data.
ORIGIN The sequence is read directly as it is a SymbolList..

EMBL

EMBLFormat reads and writes EMBL files, and understands almost all permutations of the location descriptors found in the feature tables.

In version 87 of EMBL, the format for the ID line changed. The parser will understand files with both 87 and pre-87 ID lines, but by default will write out files using the new 87 ID line format. If you wish to write files using the pre-87 ID line format, you must call the writeSequence() method directly and specify the EMBL_PRE87_FORMAT format.

Reading

The fields are passed into the RichSeqIOListener as follows:

Table 8.8. EMBLFormat input field destinations.

EMBL Field How is it processed?
ID setName(), addSequenceProperty(Terms.getMolTypeTerm()), setDivision(), setCircular(), addSequenceProperty(Terms.getGenomicTerm()), addSequenceProperty(Terms.getDataClassTerm()) (87 only)
AC First accession goes to setAccession(), all others to addSequenceProperty(Terms.getAdditionalAccessionTerm()).
SV If the accession (before the full stop ".") is different from the first accession on the AC line, then this accession becomes the primary accession, and the first accession on the AC line becomes an additional accession. Everything after the full stop goes to setVersion(). If the version line is unparseable, it is stored using addSequenceProperty(Terms.getVersionLine()) instead.
DE setDescription()
DT For creation date: addSequenceProperty(Terms.getDateCreatedTerm()) and addSequenceProperty(Terms.getRelCreatedTerm()). For last updated date: addSequenceProperty(Terms.getDateUpdatedTerm()) and addSequenceProperty(Terms.getRelUpdatedTerm()).
DR Each record is split into a database name, primary accession, and additional accessions. A CrossRef object is constructed from these first two pieces, and annotated with additional accessions using Terms.getAdditionalAccessionTerm(). The whole thing is then given a rank and sent to setRankedCrossRef().
OS Ignored.
OC Ignored.
OG addSequenceProperty(Terms.getOrganelleTerm())
RN The number of the reference becomes the rank of the RankedDocRef object later.
RP The values on this line become the start and end of the RankedDocRef object later.
RX Each of these is parsed and the database name and primary accession are used to construct a CrossRef object. All CrossRef objects are ranked and added to the sequence setRankedCrossRef(), and one of them will be added to the current reference using setCrossRef(). The one that is chosen will be MEDLINE, or PUBMED if not present, or DOI if PUBMED not present either.
RA Parsed using DocRefAuthor.Tools.parse() and becomes the set of authors for the DocRef object.
RG Parsed using DocRefAuthor.Tools.parse(), and each consortium is flagged using the setConsortium() method before being added to the set of authors for the DocRef object.
RT The title for setTitle() on the DocRef object.
RL The location for the setLocation() method on the DocRef object.
RC Used for setRemark() on the DocRef object.
KW Each keyword is sent individually to addSequenceProperty(Terms.getKeywordTerm())
CC setComment()
FH Ignored.
FT As per the GenBankFormat - please see the section on GenBank parsing.
CO Causes an exception as contigs are not supported.
AH Causes an exception as TPAs are not supported.
SQ Sequence data is passed to addSymbols().


Writing

The fields are read from the RichSequence object as follows:

Table 8.9. EMBLFormat output field sources.

EMBL Field How is it outputted?
ID getName(), getNoteSet(Terms.getMolTypeTerm()), getDivision(), getCircular(), getNoteSet(Terms.getGenomicTerm()), getNoteSet(Terms.getDataClassTerm()) (87 only)
AC getAccession(), and getNoteSet(Terms.getAdditionalAccessionTerm()).
SV getAccession() and getVersion(), or addSequenceProperty(Terms.getVersionLine()) if present.
DE getDescription()
DT For creation date: getNoteSet(Terms.getDateCreatedTerm()) and getNoteSet(Terms.getRelCreatedTerm()). For last updated date: getNoteSet(Terms.getDateUpdatedTerm()) and getNoteSetTerms.getRelUpdatedTerm()). If date created is null, then the update date is duplicated and used here as well.
DR getRankedCrossRef(), using getNoteSet(Terms.getAdditionalAccessionTerm()) to generate additional accessions.
OS getTaxon().getDisplayName()
OC getTaxon()getDisplayName(), chopped before the first bracket, and getTaxon().getNameHierarchy().
OG getNoteSet(Terms.getOrganelleTerm())
RN Each reference returned by getRankedDocRefs() is iterated over. The rank of the RankedDocRef object is output here.
RP The start and end coordinates of the RankedDocRef object.
RX The getCrossRef() output from the DocRef object.
RA The getAuthors() output from the DocRef object, with the consortiums removed.
RG The getAuthors() output from the DocRef object, with all except consortiums removed.
RT The getTitle() from the DocRef.
RL The getLocation() from the DocRef.
RC The getRemark() from the DocRef.
KW getNoteSet(Terms.getKeywordTerm()).
CC One comment section per entry in getComments().
FH No fields necessary here.
FT As per the GenBankFormat - please see the section on GenBank parsing.
CO Never generated.
AH Never generated.
SQ Sequence counts are generated, then sequence is read directly as it is a SymbolList.

UniProt

UniProtFormat reads and writes UniProt files.

Reading

The fields are passed into the RichSeqIOListener as follows:

Table 8.10. UniProtFormat input field destinations.

EMBL Field How is it processed?
ID setName(), addSequenceProperty(Terms.getMolTypeTerm()), addSequenceProperty(Terms.getDataClassTerm()), setDivision()
AC First accession goes to setAccession(), all others to addSequenceProperty(Terms.getAdditionalAccessionTerm()).
DE setDescription()
DT For creation date: addSequenceProperty(Terms.getDateCreatedTerm()) and addSequenceProperty(Terms.getRelCreatedTerm()). For last sequence updated date: addSequenceProperty(Terms.getDateUpdatedTerm()) and addSequenceProperty(Terms.getRelUpdatedTerm()). For last annotation updated date: addSequenceProperty(Terms.getDateAnnotatedTerm()) and addSequenceProperty(Terms.getRelAnnotatedTerm()).
DR Each record is split into a database name, primary accession, and additional accessions. A CrossRef object is constructed from these first two pieces, and annotated with additional accessions using Terms.getAdditionalAccessionTerm(). The whole thing is then given a rank and sent to setRankedCrossRef().
OS First named species is used as the scientific name to construct an NCBITaxon object, along with the tax ID from the OX line, and passed to setTaxon(). The second name, if present, is the common name. Subsequent names are synonyms.
OC Ignored.
OX See details for the OS line.
OG addSequenceProperty(Terms.getOrganelleTerm())
GN Gene names are passed to addSequenceProperty(Terms.getGeneNameTerm()). Gene synonyms are passed to addSequenceProperty(Terms.getGeneSynonymTerm()). Ordered locus names are passed to addSequenceProperty(Terms.getOrderedLocusNameTerm()). ORF names are passed to addSequenceProperty(Terms.getORFNameTerm()). The values have a number and a colon prefixed, where the number refers to the sequence order of the current gene.
RN The number of the reference becomes the rank of the RankedDocRef object later.
RP The whole value is passed to setRemark(). If it contains the words 'SEQUENCE OF', then the sequence position is parsed out and becomes the start and end of the RankedDocRef object later.
RX Each of these is parsed and the database name and primary accession are used to construct a CrossRef object. All CrossRef objects are ranked and added to the sequence setRankedCrossRef(), and one of them will be added to the current reference using setCrossRef(). The one that is chosen will be MEDLINE, or PUBMED if not present, or DOI if PUBMED not present either.
RA Parsed using DocRefAuthor.Tools.parse() and becomes the set of authors for the DocRef object.
RG Parsed using DocRefAuthor.Tools.parse(), and each consortium is flagged using the setConsortium() method before being added to the set of authors for the DocRef object.
RT The title for setTitle() on the DocRef object.
RL The location for the setLocation() method on the DocRef object.
RC Comments are key-value pairs. Species comments are passed to addSequenceProperty(Terms.getSpeciesTerm()). Strain comments are passed to addSequenceProperty(Terms.getStrainTerm()). Tissue comments are passed to addSequenceProperty(Terms.getTissueTerm()). Transposon comments are passed to addSequenceProperty(Terms.getTransposonTerm()). Plasmid comments are passed to addSequenceProperty(Terms.getPlasmidTerm()). The values have a number and a colon prefixed, where the number refers to the rank of the current RankedDocRef.
KW Each keyword is sent individually to addSequenceProperty(Terms.getKeywordTerm())
CC If the comment is parseable using UniProtCommentParser then the value is passed to setComment(). Otherwise, it is assumed to be the copyright message that comes with UniProt records, and is passed to addSequenceProperty(Terms.getCopyrightTerm()).
FT Each feature encountered triggers a call to startFeature(), and calls endFeature() on completion. The location is parsed out using UniProtLocationParser. The source term is Terms.getUniProtTerm(), whereas the type term is a term from RichObjectFactory.getDefaultOntology().getOrCreateTerm() equivalent to the name of the feature. The feature description is stored using addFeatureProperty(Terms.getFeatureDescTerm()). Subsequent lines beginning with '/' are added as qualifiers. The only qualifier with a predefined term is 'FTId', which is represented by Terms.getFTIdTerm(). All others encountered have terms generated from RichObjectFactory.getDefaultOntology().getOrCreateTerm() with names equivalent to the name of the qualifier. Qualifiers are added using addFeatureProperty(). UniProt uses its own unique set of feature names. No attempt is made to translate other feature names to/from this set.
SQ Sequence data is passed to addSymbols().


Writing

The fields are read from the RichSequence object as follows:

Table 8.11. UniProtFormat output field sources.

EMBL Field How is it outputted?
ID getName(), getNoteSet(Terms.getMolTypeTerm()), getNoteSet(Terms.getDataClassTerm()), getDivision()
AC getAccession(), and getNoteSet(Terms.getAdditionalAccessionTerm()).
DE getDescription()
DT For creation date: getNoteSet(Terms.getDateCreatedTerm()) and getNoteSet(Terms.getRelCreatedTerm()). For last updated date: getNoteSet(Terms.getDateUpdatedTerm()) and getNoteSetTerms.getRelUpdatedTerm()). For last annotation date: getNoteSet(Terms.getDateAnnotatedTerm()) and getNoteSetTerms.getRelAnnotatedTerm()). If date created or date annotated is null, then the update date is duplicated and used here as well.
DR getRankedCrossRef(), using getNoteSet(Terms.getAdditionalAccessionTerm()) to generate additional accessions.
OS getTaxon().getDisplayName() followed by all synonyms from getNames(NCBITaxon.SYNONYM) in brackets.
OC getTaxon().getNameHierarchy().
OG getNoteSet(Terms.getOrganelleTerm())
OX getTaxon().getNCBITaxID()
GN Gene names are written from getNoteSet(Terms.getGeneNameTerm()). Gene synonyms are written from getNoteSet(Terms.getGeneSynonymTerm()). Ordered locus names are written from getNoteSet(Terms.getOrderedLocusNameTerm()). ORF names are written from getNoteSet(Terms.getORFNameTerm()). As the values have a number and a colon prefixed, where the number refers to the sequence order of the current gene, these values are used to keep the correct names grouped together. This prefix is not included in the output.
RN Each reference returned by getRankedDocRefs() is iterated over. The rank of the RankedDocRef object is output here.
RP The getRemark() from the DocRef.
RX The getCrossRef() output from the DocRef object.
RA The getAuthors() output from the DocRef object, with the consortiums removed.
RG The getAuthors() output from the DocRef object, with all except consortiums removed.
RT The getTitle() from the DocRef.
RL The getLocation() from the DocRef.
RC Comments are key-value pairs. Species comments are from getNoteSet(Terms.getSpeciesTerm()). Strain comments are from getNoteSet(Terms.getStrainTerm()). Tissue comments are from getNoteSet(Terms.getTissueTerm()). Transposon comments are from getNoteSet(Terms.getTransposonTerm()). Plasmid comments are from getNoteSet(Terms.getPlasmidTerm()). As the values have a number and a colon prefixed, where the number refers to the rank of the current RankedDocRef, this is used to match the appropriate comments with each reference. This prefix is not included in the output.
KW getNoteSet(Terms.getKeywordTerm()).
CC One comment section per entry in getComments().
FT Each feature is written out using UniProtLocationParser to construct the location string from the feature's getLocation() output, with the feature name being the getType() of the feature and the description being getNoteSet(Terms.getFeatureDescTerm()) on the feature. The FTId, if present in the feature from getNoteSet(Terms.getFTIdTerm()), is written out underneath. No other qualifiers are written out. UniProt uses its own unique set of feature names. No attempt is made to translate other feature names to/from this set.
SQ Sequence counts are generated, then sequence is read directly as it is a SymbolList.

INSDSeq (XML)

For parsing files that conform to http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt.

INSDSeqFormat is similar to the GenBank flat-file format in the way it organises information. Data will end up in the same places and using the same annotation terms. There are no additional annotation terms involved which are not also present in the GenBank flat-file format.

EMBLxml (XML)

For parsing files that conform to http://www.ebi.ac.uk/embl/Documentation/DTD/EMBL_dtd.txt.

EMBLxmlFormat is very similar to the EMBL flat-file format. Data will be parsed in much the same way and end up in the same locations. There are no additional annotation terms involved which are not also present in the EMBL flat-file format.

The only major difference between EMBL flat-file and EMBL XML is the location tags. In XML, they are highly structured. The parser gets round this complexity by constructing Genbank-style location strings out of the XML hierarchies. These strings are then passed to GenbankLocationParser for parsing into RichLocation objects. On output, the location tags are constructed directly from the RichLocation objects.

UniProtXML (XML)

For parsing files that conform to http://www.ebi.uniprot.org/support/docs/uniprot.xsd.

UniProtXMLFormat is very complex. The parser attempts to treat it in the same way as normal UniProt data, and information will end up in the same locations.

Throughout the format, evidence attributes (not tags) are ignored. There is simply no way to fit them into the BioJavaX object model.

Like the UniProt flat-file format, locations are passed through the UniProtLocationParser. Fuzziness may not be correctly interpreted as frequently not enough information is supplied to be able to construct the mininum requirements of a Position object. You may see exceptions being thrown on files which attempt to specify fuzziness without relation to a specific base or range of bases.

Comments are parsed and converted into flat-file UniProt comments using the UniProtCommentParser, and converted back again when outputting in this format. This allows for greater interoperability between the two formats, and also allows the UniProt XML comment data to be stored in the plain-text format expected by databases such as BioSQL. Some comments have been renamed in UniProt XML as opposed to the flat-file format. These comments will be parsed and converted to use the flat-file naming convention inside BioJavaX, but when they are output again, they will go back to their correct UniProt XML names. This is to increase interoperability between the two UniProt formats.

UniProt XML uses its own unique set of feature names, different even from the flat-file UniProt format. No attempt is made to translate other feature names to/from this set.

The UniProt XML format has no concept of a sequence description. However, it does have a protein tag which describes the structure of the sequence. This is parsed into a single protein description string and used as the value for setDescription(). Each part of the protein description is enclosed in square brackets and prefixed by the word 'Contains' for domains, and 'Includes' for components. Attempting to write a sequence that has a description which does not conform to this standard may produce interesting results.

Keywords in UniProt XML have identifier numbers associated with them. A special ontology, Terms.getUniprotKWOnto(), is used to store these keywords and their identifiers as they are encountered over time. If a keyword is encountered with an unknown identifier during output, then the word 'UNKNOWN' is output in place of the identifier.

The secondary/tertiary/additional accessions for database cross-references in UniProt XML have hard-coded names which depend on the position of the accession and the name of the database. If the database name does not match one of the known ones, or an unexpected accession is found, then the name used will be Terms.getAdditionalAccessionTerm().

A number of additional annotation terms are used by UniProt XML. These are:

Table 8.12. Additional UniProtXMLFormat annotation terms.

Terms Usage
Terms.getProteinTypeTerm() Used to store the type attribute from the protein tag.
Terms.getEvidenceCategoryTerm() Used to store the category attribute of the evidence tag.
Terms.getEvidenceTypeTerm() Used to store the type attribute of the evidence tag.
Terms.getEvidenceDateTerm() Used to store the date attribute of the evidence tag.
Terms.getEvidenceAttrTerm() Used to store the attribute attribute of the evidence tag.
Terms.getFeatureRefTerm() Used to store the ref attribute of the feature tag.
Terms.getFeatureOriginalTerm() Used to store the value of the original sub-tag of the feature tag.
Terms.getFeatureVariationTerm() Used to store the value of the variation sub-tag of the feature tag.
Terms.getFeatureStatusTerm() Used to store the status attribute of the feature tag.
Terms.getLocationSequenceTerm() Used to store the seq attribute of the location sub-tag of the feature tag.

New formats

If you want to add a new format, the best thing to do is to extend RichSequenceFormat.BasicFormat and go from there. In order to make your class work with the automatic format-guesser (RichSequence.IOTools.readFile()) you'll need to implement canRead() and guessSymbolTokenization(), and add a static initializer block to your class, similar to this:

public class MyFormat extends RichSequenceFormat.BasicFormat {
    static {
        RichSequence.IOTools.registerFormat(MyFormat.class);
    }
 
    // implement the rest of the class here ...
}

NCBI Taxonomy data

The NCBI taxonomy loader operates outside the standed file parsing framework, as it is not dealing with a single file and does not generate sequence objects. Instead, it provides separate functions for reading the nodes.dmp and names.dmp files line-by-line, and returning the corresponding NCBITaxon object for each line of the file. An example to load the taxonomy data follows:

NCBITaxonomyLoader l = new SimpleNCBITaxonomyLoader();
BufferedReader nodes = new BufferedReader(new FileReader("nodes.dmp"));
BufferedReader names = new BufferedReader(new FileReader("names.dmp"));
        
NCBITaxon t;
while ((t=l.readNode(nodes))!=null);  // read all the nodes first
while ((t=l.readName(names))!=null);  // then read all the names 
 
// if your LRU cache is big enough, it'll now hold fully-populated instances 
// of all the taxon objects. Not much use unless you're using a database! 

Note that this is most effective when using BioJavaX with Hibernate to persist data to the database. You do not need to do anything apart from wrap the above code in a transaction, and it will be persisted for you.

Note that you may have trouble with duplicate NCBITaxon objects or names going missing if you have an LRU cache in RichObjectFactory that is too small. This issue is avoided altogether when using the BioSQLRichObjectFactory.


When File Parsers Go Wrong

Sometimes you'll come across a file that is not strictly in the correct format, or you may even uncover a bug in one of the parsers. We always appreciate feedback in these cases, including the input file in question and a full stack trace. However, sometimes you may want to find the problem yourself, or even attempt to fix it! So we have produced the DebuggingRichSeqIOListener for this purpose.

The DebuggingRichSeqIOListener is a class that acts both as a BufferedInputStream, so it can be passed to a RichSequenceFormat for reading data, and as a RichSeqIOListener, so that it can be passed to the same RichSequenceFormat to listen to the sequence generation events. It dumps all input out to STDOUT as it reads it, and notifies every sequence generation event to STDOUT as it is received. This way you can see exactly at which points in the file the events are being generated, the data the format was working on at the time the event was generated, and if an exception happens, it will appear immediately after the section of the file that was in error.

The idea is that you do something like this (the example debugs the parsing of a FASTA file):

Namespace ns = RichObjectFactory.getDefaultNamespace();
InputStream is = new FileInputStream("myFastaFile.fasta");
FastaFormat format = new FastaFormat();
 
DebuggingRichSeqIOListener debug = new DebuggingRichSeqIOListener(is);
BufferedReader br = new BufferedReader(new InputStreamReader(debug));
 
SymbolTokenization symParser = format.guessSymbolTokenization(debug);
 
format.readRichSequence(br, symParser, debug, ns);

Note that you will often get bits of file repeated in the output, as the format runs backwards and forwards through the file between markers it has set. This is perfectly normal although it may look a little strange.

When reporting problems with file parsing, it would be very useful if you could run the above code on your chosen input file and chosen RichSequenceFormat, and send us a copy of the output along with the stacktrace and input file.

Creative file parsing with RichSeqIOListener.

Using RichSeqIOListeners directly

In order to do creative file parsing, you need to start using very low level BioJava APIs. This involves setting up a RichSeqIOListener and allowing it to communicate directly with the RichSequenceFormat instances that parse files. You have to choose whether you want just to listen to data as it is read from the file, or whether you want to use these events to construct a RichSequence object.

Listening to events only

You need to write a class which implements RichSeqIOListener. The easiest way to do this is to extend RichSeqIOAdapter, which is a very simple implementation which ignores everything and returns dummy empty features whenever getCurrentFeature() is called.

You can then use your class like this (see the earlier section on RichStreamReader for how to construct the various other objects required):

BufferedReader input = ...;       // your input file
Namespace ns = ...;               // the namespace to read sequences into 
SymbolTokenization st = ...;      // the tokenization used to parse sequence data
 
RichSeqIOListener listener = ...; // your custom listener object
 
boolean moreSeqsAvailable = true; // assume there is at least one sequence in the file
while (moreSeqsAvailable) {
     moreSeqsAvailable = format.readRichSequence(input, st, listener, ns);
     // your listener will have received all the information for the current sequence by this stage
}

Constructing sequences from events

You need to write a class which implements both RichSeqIOListener and RichSequenceBuilder. Again you could just extend RichSeqIOAdapter, and implement the extra methods required by RichSequenceBuilder to make it fully functional. You will obviously need to store information passed to your insta