BioJava:Cookbook:Annotations:Filter
How do I filter sequences based on their species?
The species field of a GenBank SwissProt or EMBL file ends up as an Annotation entry. Essentially all you need to do is get the species property from a sequences Annotation and check to see if it is what you want.
The species property name depends on the source: for EMBL or SwissProt it is “OS” for GenBank it is “Organism”.
The following program will read in Sequences from a file and filter them according to their species. The same general recipe with a little modification could be used for any Annotation property.
```java import java.io.*;
import org.biojava.bio.*; import org.biojava.bio.seq.*; import org.biojava.bio.seq.db.*; import org.biojava.bio.seq.io.*;
public class FilterEMBLBySpecies {
public static void main(String[] args) {
try {
//read an EMBL file specified in args[0]
BufferedReader br = new BufferedReader(new FileReader(args[0]));
SequenceIterator iter = SeqIOTools.readEmbl(br);
//the species name to search for (specified by args[1]);
String species = args[1];
//A sequenceDB to store the filtered Seqs
SequenceDB db = new HashSequenceDB();
//As each sequence is read
while(iter.hasNext()){
Sequence seq = iter.nextSequence();
Annotation anno = seq.getAnnotation();
//check the annotation for Embl organism field "OS"
if(anno.containsProperty("OS")){
String property = (String)anno.getProperty("OS");
//check the value of the property, could also do this with a regular expression
if(property.startsWith(species)){
db.addSequence(seq);
}
}
}
//write the sequences as FASTA
SeqIOTools.writeFasta(System.out, db);
}
catch (Exception ex) {
ex.printStackTrace();
}
}
} ```