BioJava:Tutorial:Sequences and Features

Tutorial

By Thomas Down

Chapter 1 of this tutorial covered the SymbolList interface, BioJava’s basic representation of biological sequence data. This chapter examines the Sequence interface. This adds extra functionality to SymbolList, providing a convenient way to handle annotated sequences from biological database. This chapter concentrates on classes and interfaces defined in the package org.biojava.bio.seq. For full descriptions of all the API used here, please consult the JavaDoc API documentation (latest biojava 1.8).

A tour of a Sequence

Sequence is a sub-interface of SymbolList. Thus, all the standard methods for accessing sequence data in a symbol list can equally be applied to a sequence, and sequences can be passed to any analysis methods which normally expect to receive a symbol list. The Sequence interface adds two types of additional data to a symbol list:

  • Global annotations, such as names, database identifiers, and literature references
  • Location-specific annotations (so called features)

Two pieces of global annotation information are considered to be sufficiently important that they have dedicated accessor methods. The name of the sequence is a simple string description of the sequence: normally the name or accession number of the sequence in the database from which it is retrieved. The getURN method, on the other hand, should return a more structured identifier for the sequence, represented as a Uniform Resource Identifier (URI) e.g.:

  • urn:sequence/embl:AL121903
  • file:///home/thomas/genome.fasta|rpoN
  • <nowiki>http://adzel.casseiopeia.org/seqs/myseqs.fasta|seq0001</nowiki>
  • acedb://humace.sanger.ac.uk/DNA/AL121903

URNs are a special class of URIs which represent global names for ‘well known’ resources. Note that, despite the method name, it may not be appropriate to give an actual URN for sequences. However, for sequences from databases such as EMBL, where many sites have local installations, use of URNs is encouraged.

The exact use of the name and URN properties is currently dependent to some extent on how the sequence was loaded. As BioJava enters more common use, more formal definitions of these properties will emerge.

Other annotations

In additions to the two ‘identifier’ properties of the sequence, it may have other annotation data associated with it. BioJava contains an Annotation interface, which represents a set of key-value pairs, a little like a Java Map (indeed, Annotation has an asMap method).

Sequence seq = getSequence();
Annotation seqAn = seq.getAnnotation();
for (Iterator i = seqAn.keys().iterator(); i.hasNext(); ) {
    Object key = i.next();
    Object value = seqAn.getProperty(key);
    System.out.println(key.toString() + ": " + value.toString());
}

Annotation objects aren’t just used in sequences - many other BioJava objects, including Features, can also have annotations associated with them.

Currently, there are no specific conventions for the kind of data which might be found in an annotation. In general, the keys should be strings (although there is no requirement that this be the case). But the values may be any Java object. More guidelines for the contents of Annotation objects may be introduced as BioJava develops.

Features and FeatureHolders

A feature represents a region of a sequence with some defined properties attached. Typically, features might represent structures such as genes and repeat elements on chromosomes, or alpha helices in proteins. As a Java interface, Feature has the following basic properties:

  • A location within the sequence, represented by a Location object. This has a defined start and end (equal in the case of point locations), and may or may not be contiguous.
  • A type (for instance, “gene” or “helix”).
  • A source (often the name of the program which discovered the feature.
  • An Annotation object, which can contain any other data.

In addition, all features have a place in a ‘tree’ of features, attached to a sequence. Features cannot be created independently of a sequence.

If a large class of features exists which have important properties over and above those represented in the Feature interface, a sub-interface of Feature may be defined. Currently, there is only one such sub-interface in the BioJava core: StrandedFeature. This is used for features in duplex DNA which have a defined directionality. For instance, genes would normally be represented with StrandedFeature, while some kinds of regulatory region might be plain features.

Sets of features are stored in objects implementing the FeatureHolder interface. Sequence is a sub-interface of FeatureHolder. Feature itself also extends FeatureHolder, giving the possibility of representing ‘nested’ features. For instance, a feature representing a large genetic regulatory region might contain sub-features annotating individual transcription factor binding sites. The recursive method below will print a simple text representation of a tree of features:

public void printFeatures(FeatureHolder fh, PrintWriter pw, String prefix)
{
    for (Iterator i = fh.features(); i.hasNext(); ) {
        Feature f = (Feature) i.next();
    pw.print(prefix);
    pw.print(f.getType());
    pw.print(" at ");
    pw.print(f.getLocation().toString());
    pw.println();
    printFeatures(f, pw, prefix + "    ");
    }
}

All Feature implementations include two methods which indicate how it fits into a feature tree. getParent returns the FeatureHolder object (Sequence or Feature) which is the feature’s immediate parent, while getSequence returns the Sequence object which is the root of the tree. Feature objects are always associated with a specific sequence, and always have exactly one parent FeatureHolder.

Creating new features

It is expected that there will never be any publicly visible implementations of Feature or its sub-interfaces. Instead, features should be produced using the createFeature method of a FeatureHolder object. This ensures that there are no ‘orphan’ features, not properly attached to a parent sequence. It also gives Sequence implementors the chance to control the attachment of features to their sequence class. Some sequences may only accept certain kinds of features. Other implementations, especially those intimately coupled with database storage mechanisms, may wish to use their own special implementations of the Feature interface.

The createFeature method has the following signature:

public Feature createFeature(Feature.Template template);

there is no requirement that a particular FeatureHolder object should include a working implementation of this method. If it is not possible to create a new child feature, UnsupportedOperationException will be thrown. In particular, this method is only implemented by Sequence and Feature objects. When FeatureHolder instances are used to return arbitrary ‘bags’ of features, they will never support this method.

Feature.Template is a concrete nested class of the Feature interface. It just contains public fields corresponding to each property of Feature. A feature could be attached to a sequence as follows:

Feature.Template template = new Feature.Template();
template.type = "TestFeature";
template.source = "Test";
template.location = new RangeLocation(100, 200);
template.annotation = Annotation.EMPTY_ANNOTATION;
mySequence.createFeature(template);

Every sub-interface of Feature should have a nested class, also named Template, which extends Feature.Template and adds any extra fields needed to construct that specialized kind of feature.