BioJava:CookBook:genome:Overview

The biojava3-genome library leverages the sequence relationships in biojava3-core to read(gtf,gff2,gff3) files and write gff3 files. The file formats for gtf, gff2, gff3 are well defined but what gets written in the file is very flexible. We currently provide support for reading gff files generated by open source gene prediction applications GeneID, GeneMark and GlimmerHMM. Each prediction algorithm uses a different ontology to describe coding sequence, exons, start or stop codon which makes it difficult to write a general purpose gff parser that can create biologically meaningful objects. If the application is simply loading a gff file and drawing a colored glyph then you don’t need to worry about the ontology used. It is easier to support the popular gene prediction algorithms by writing a parser that is aware of each gene prediction applications ontology.

The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions.

Passing the collection of ChromsomeSequences to GeneFeatureHelper.getProteinSequences would return all protein sequences. You can then write the protein sequences to a fasta file.


`           LinkedHashMap`<String, ChromosomeSequence>chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf"));`  
`           LinkedHashMap`<String, ProteinSequence>proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values());`  
`           FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());`

You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case


`           LinkedHashMap`<String, GeneSequence>geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values());`  
`           Collection`<GeneSequence>geneSequences = geneSequenceHashMap.values();`  
`           FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);`

You can easily write out a gff3 view of a ChromosomeSequence with the following code.


`            FileOutputStream fo = new FileOutputStream("genemark.gff3");`  
`            GFF3Writer gff3Writer = new GFF3Writer();`  
`            gff3Writer.write(fo, chromosomeSequenceList);`  
`             fo.close();`

The chromsome sequence becomes the middle layer that represents the essence of what is mapped in a gtf, gff2 or gff3 file. This makes it fairly easy to write code to convert from gtf to gff3 or from gff2 to gtf. The challenge is picking the correct ontology for writing into gtf or gff2 formats. You could use feature names used by a specific gene prediction program or features supported by your favorite genome browser. We would like to provide a complete set of java classes to do these conversions where the list of supported gene prediction applications and genome browsers will get longer based on end user requests.