BioJava:Tutorial:Sequence IO basics
Tutorial
By Thomas Down
This chapter covers the BioJava support for handling biological sequence
data available in the form of files. It covers API provided by the
package org.biojava.bio.seq.io
. For a complete overview of the API
provided in this package, consult the JavaDoc API documentation (latest
biojava 1.8).
NOTE: this chapter has been updated for BioJava release 1.2.
Getting started with sequence I/O
The BioJava sequence I/O code is designed to be flexible and easy to
adapt for a wide variety of purposes. However, if you don’t need this
flexibility, there are some convenience methods which set up the parsers
for reading a variety of common formats. All these methods take a Java
BufferedReader
object, and return an iterator which allows you to scan
through the sequences in a file. For example:
For a full list of formats supported in this way, check the Javadoc
documentation for the SeqIOTools
class.
Sequence input goals
A typical biological sequence file contains three things:
- Global information about the sequence (ID, species, etc.)
- Annotations to specific regions of the sequence.
- Actual sequence data
Actual file formats need not provide all of these. For instance, FASTA files contain almost pure sequence data – the only other information is a single description line for each sequence. At the other extreme, GFF files are simply a list of features, with no sequence data in the file.
A BioJava Sequence object contains the same kinds of information as a
sequence file. The primary aim of the input architecture is obviously to
take a stream containing sequence file data, and return one or more
Sequence
objects. In addition, there are two other goals:
Decoupled Sequence
creation: BioJava represents sequence data using the Sequence
interface. We allow multiple implementations of this interface, which might be optimized for quite specific purposes. Some implementations will be purely in-memory objects, while others might be persistant objects reflecting data in some kind of database. We want to allow you to create any kind of Sequence
object from a given data stream.
Pluggable filters: Not all users will wish to exactly reflect the contents of a sequence file as a Sequence
object. Sometimes it is useful to select specific pieces of data from a file, or to change it into some other format. For instance, BioJava has a hierarchical model for features attached to a sequence, whereas many file formats (for instance, EMBL) do not. You might wish to rebuild some kind of feature hierarchy from an EMBL flatfile during the parsing process.
SequenceBuilders
The sequence input framework is based around the SequenceBuilder
interface (this is actually a sub-interface of SeqIOListener
, but for
these purposes you will usually be using the SequenceBuilder
class).
The role of a sequence builder is to accumulate information discovered
while parsing a sequence file, and ultimately to construct a Sequence
object.
There are two kinds of SequenceBuilder
implementations:
Builders: These actually contruct new Sequence
objects. Generally, there will just be one Builder implementation for each Sequence
implementation. The basic BioJava library provides one Builder implementation, SimpleSequenceBuilder
, which constructs simple in-memory representations for any kind of sequence data.
Filters: These don’t construct Sequence
objects themselves, but are chained to another sequence builder. When they are notified of data, they perform some processing, then pass the information on to the next sequence builder in the chain.
Whenever a sequence builder is required, you can either simply provide a ‘Builder’ implementation, or you can create a chain consisting of one or more ‘Filters’, leading ultimately to a ‘Builder’.
A SequenceBuilder
object should only be used once. If multiple
sequences are being read from a stream, a new SequenceBuilder
(or
chain) should be constructed for each one. For convenience, we provide a
SequenceBuilderFactory
interface, whose sole purpose is to encapsulate
the construction of SequenceBuilder
objects. Each SequenceBuilder
implementation should provide a suitable factory implementation as well.
For ‘Builder’ implementations, it is usually possible to provide a
‘singleton’ factory object. For SimpleSequenceBuilder
this is the
static field SimpleSequenceBuilder.FACTORY
. For filters, the factory
must be parameterized with another SequenceBuilderFactory
so that a
complete chain can be constructed. For instance:
SequenceBuilderFactory mySBF =
new EmblProcessor.Factory(SimpleSequenceBuilder.FACTORY);
Authors of new SequenceBuilder
implementations are encouraged to
consider this naming style when implementing SequenceBuilderFactory
.
Putting it together: StreamReader
The simplest way to use the BioJava sequence input code is to construct
a StreamReader
. The constructor takes four paramters:
- A normal Java
BufferedReader
object, encapsulating the stream of data to parse. - A
SequenceFormat
object, which is responsible for actually parsing sequence data from the stream. - A
SymbolTokenization
object, which represents a mapping from textual characters to BioJavaSymbol
objects. - A
SequenceBuilderFactory
object to support construction ofSequence
objects.
A StreamReader
object might be constructed as follows:
Alphabet dna = DNATools.getDNA();
SymbolTokenization dnaParser = dna.getTokenization("token");
BufferedReader br = new BufferedReader(new FileReader(fileName));
SequenceBuilderFactory sbf = new FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY);
StreamReader stream = new StreamReader(br, new FastaFormat(), dnaParser, fact);
(This is just a snippet from the example program in chapter 1, and you may like to refer back for more information.)
The StreamReader
class implements the SequenceIterator
interface, so
you can easily iterate over all sequences in a stream:
while (stream.hasNext()) {
Sequence seq = stream.nextSequence();
// Perform some processing on seq
}
Another application: IndexedSequenceDB
As biology enters the post-genomic era, it is common to need to work
with databases of sequence data far too large to fit in available
memory. One way to handle large amounts of sequence is to use a
dedicated database system: either a specialized solution such as
ACeDB or a set of tables in a standard database
application, as used by the Ensembl project.
If, however, you don’t wish to use one of these solutions, BioJava
offers a simple and efficient sequence database implementation backed by
one or more sequence files on disk. These files can be in any format, so
long as a suitable SequenceFormat
class exists.
As a simple example of an IndexedSequenceDB
in use, the following
servlet retrieves sequences from a large database, and sends them on to
the client in FASTA format. The database could be created using the
CreateIndex and AddFiles programs included in the BioJava demos
directory.