Process files as streams of records, each with tags with values.


Many files in biology are structured as multiple records, each of which can be broken down into lines composed from some tag and an associated value. For example, EMBL files have two letter tags and values extend from column 5 to the end of the line. There are a vast array of files all of which have broadly similar structures, and this package aims to provide a framework within which parsing strategies and data consumers can be reused as much as possible.

The data associated with each record is represented by a stream of events encapsulated by callbacks on the TagValueListener interface. It is up to the user to provide implementations of this interface that build static representations of the data if they so wish.

The Parser and Pushing Sub-Documents

Often file formats have embedded sub-documents. For example, in EMBL format files the feature table area is identical to that in GENBANK files if the first five columns are ignored. In ACeDB files, every time an ace-tag is found, it causes a new sub-document to be induced with its own structure and set of allowed tags and values. Python code uses indent depth to represent code blocks.

Parser allows TagValueListener objects to request that all of the values associated with the current tag should be handled by a new TagValueParser and TagValueListener pair. The Parser instance will use the original TagValueParser to process the line as before, and then take the value that would have been handed to the listener's value method, and present it to the newly registered TagValueParser to tokenize into tag and value portions. That tag and value will then be passed onto the new TagValueListener. The new TagValueListener can itself choose to push a new pair of parser and listener to start a new sub-sub-document. This can be repeated to arbitrary depth. As soon as a parser and listener pair are registered, the pushed listener receives a startRecord() message. Once the entire containing record ends (due to a record separator line such as "//", or because the end of file has been reached), or if the tag that caused the delegation ends, the pushed listener will receive the appropriate endTag() message and also endRecord().

TagDelegator is a useful helper class that always delegates to a given parser and listener pair on a given tag.

Rewriting the Event Stream

Often while parsing, you will need to change tag names or modify values. In the simple case, all the tags and values will be String instances. You will probably want different types, such as the numeric objects (Double, Int and their friends), or to instantiate your own objects from these Strings. Additionaly, some values are themselves better represented as lists of more fundamental items. There are several TagValueListener helper classes that extend TagValueWrapper that allow you to configure a chain of event transducors while writing the minimal amount of code.

TagMapper remaps a sub-set of the tags it sees. For example, it could be configured to replace all "FOO_ID" tags with "accession.number".

ValueChanger intercepts the value() calls for specific tags and uses either a ValueChanger.Changer or ValueChanger.Splitter instance to replace or sub-devide the value before passing it onto another listener.