Class BlastXMLParser

  • All Implemented Interfaces:
    StAXContentHandler

    public class BlastXMLParser
    extends StAXContentHandlerBase
    This class parses NCBI Blast XML output.

    It has two modes:- i) single output document mode: this takes a document containing a single BlastOutput element and parses it. This is generated when a single query is searched against a sequence database.

    ii) multiple query document mode: unfortunately, NCBI BLAST concatenates the results of multiple searches in one file. This leads to an ill-formed document that violates every XML format known to the human race and other nearby civilisations. This parser will take a bowdlerised version of this output that is wrapped in a blast_aggregate element.

    The massaged form is generated by stripping the XML element and DOCTYPE elements and wrapping all the classes in a single blast_aggregate element. In Linux, this can be done with:-

     #!/bin/sh
     # Converts a Blast XML output to something vaguely well-formed
     # for parsing.
     # Use: blast_aggregate  
    
     # strips all <?xml> and <!DOCTYPE> tags
     # encapsulates the multiple <BlastOutput> elements into <blast_aggregator>
    
     sed '/>?xml/d' $1 | sed '/<!DOCTYPE/d' | sed '1i\
     <blast_aggregate>
     $a\
     </blast_aggregate>' > $2
    
    Author:
    David Huen
    • Field Detail

      • staxenv

        public org.biojava.bio.program.sax.blastxml.StAXFeatureHandler staxenv
        Nesting class that provides callback interfaces to nested class