Talk:BioJava3 Proposal

From BioJava

Jump to: navigation, search

Some points:

"There is no support for changing file formats. It supports one version or another, but cannot handle both."

Maybe it can use a XML converter utility, convert the data to XML and use the utility to convert it to another format. XML is the standart to data exchange in most applications nowadays.

public Document toXML(File file, ConverterUtility.SOME_FORMAT_CONSTANT);

public File toFormat(Document doc, ConverterUtility.SOME_FORMAT_CONSTANT);

something like this, or it can use the AbstractFactory Design Pattern... only a sugestion :-P

This is essentially how the existing BioJava works - by converting all files into a common BioJava object model, it can then convert them into other formats. The drawback is though that it is almost impossible to design an object model (or XML format) which can represent all formats without losing or restructuring any information. This can sometimes prevent complete lossless conversion. I'd prefer to see a system set up such that all files are stored in object models which exactly represent the original format, and for converters to directly translate between them. I think this is possible using annotations (as you suggest below) and bean conversion utilities (to be investigated). - RH 19/9/07

"We would aim to be fully J2EE compliant, with the majority of components fully reusable as a bean in any other application, just like Spring's components are."

J2EE is no longer in use, the expression now is "Java EE" :-)

I've edited the page. Thanks for pointing this out! - RH 19/9/07

There´s a "new" testing framework that could be used for testing, TestNG[ http://testng.org/doc/]

I'll investigate and have added the suggestion to the main page. - RH 19/9/07

I do not look inside the code yet, but new frameworks are taking advantage of the @nnotation facility of Java 5, maybe it could be included. :-)

This will definitely be done! I've made it clearer that this is so on the main page. - RH 19/9/07

On Records, how about something like this:

  • A RecordSet is a Record and has Records and implements getRecordsIterator()
  • A Record is Formattable and Formatted
  • A RecordSource is a RecordSet and has a URL, implements a constructor RecordSource(URL, [Format])
  • A Formattable object implements toFormat(Format f) throws FormatException which calls f.format(this)
  • A Format has a FormatVersion implements format(<Type> obj)for every <Type> of object it knows how to format and also implements a constructor Format(<Type> obj) for every <Type> of object it can create (both throw FormatExceptions)
  • A Formatted object implements fromFormat(Format f) throws FormatException which calls new Format(this)
// Now for instance we can read a FASTA sequence file like this (SequenceList class must implement RecordSource):
SequenceList seq = new SequenceList("universal/resource/locator", FastaFormat.Singleton);
// Sequences can then be easily lazyloaded

// Might convert to Genbank like this:
seq.toFormat(GenbankFormat.Singleton);

Jflatow 12:33, 19 September 2007 (EDT)

These are good ideas, however I'm a little unsure about the wisdom of embedding the conversion utilities into the Format objects. This would mean that if I created a new Format, I would potentially have to make modifications to all other Formats to add new converter methods. Ideally I'd like to have a system which is separate and is controlled simply by specifying a series of translation statements like 'attribute x in format A maps to attribute y in format B', probably in XML form. That way we separate conversion tools from Formats, and are able to write new mappings between different pairs of Formats without having to alter any Java code. There is a tool which does something like this already: Dozer - RH 20/9/07

Understood, and the implementation of toFormat and fromFormat do not necessarily have to call a BioJava distributed Format to do their dirty work. For instance, if you want to add a new BioJava class and write a custom FASTA-like file, you could extend FastaFormat and implement only the conversion for your BioJava class. Then call:

SequenceListLike seq = new SequenceListLike("...", FastaFormatPlus.Singleton)
// also could use anonymous classes
seq.toFormat(new GenBankFormat() { 
     public String format(SequenceListLike s) {
       ...
     }
});

Adding XML in here seems kind of scary to me. If one of the goals is to make it more user-friendly I personally would like to see the overall structure of how things are designed be more unified so that adding functionality to any part of the library is always analogous to extending the Java class hierarchy (and I agree that noone should ever have to go back and change things that have already been implemented, unless they were implemented incorrectly). I think the Google Web Toolkit GWT is a good example of how this can be done really well. In fact, since I'm mentioning it, I actually think integrating with the GWT would be a great way to provide unified GUI support. This would give BioJava a definite edge over other Bio* frameworks as far as developing web applications goes.

Finally, I think it's great that you've started this conversation and are hosting it publicly, keep up the good work!

Jflatow 09:03, 20 September 2007 (EDT)

The conversion mapping files/definitions/XML/whatever would not necessarily be visible to the user. The user would use a single set of static conversion methods in some class(es) which would internally delegate the work to a mapping/conversion tool appropriate to the task. Only BioJava developers would have to worry about the actual mapping definitions - users would be none the wiser. Your example code above is nice in that it encapsulates the transformation neatly and briefly, so we'd like the eventual system to be equally as easy to use. This may mean coming up with our own simplified mapping mechanism in pure Java rather than XML. The ability to easily define new custom formats and the transformations between them is something we need to make as easy as possible. Thanks for your suggestions! - RH 20/9/07.

--

Extend String: Some people in list are suggesting the use of String to store DNA, RNA or Protein data always as possible instead SymbolList or Sequence, because there are some facilities using Strings class that not found on SymbolList or Sequences. Extend String class adding some funcionalities would be another way to work with bio data or it´s a bad idea?

Guedes 10:09, 20 September 2007 (EDT)

Nice idea but unfortunately String is declared a final class and cannot be extended. However what you suggest is interesting. We could, for instance, create a proxy to String that wraps a String instance internally yet also behaves as a SymbolList and does any necessary conversions only when absolutely necessary. - RH 20/9/07

Realy extends cannot be made on Strings, but the creation of a wrapper to String realy seams very good. Guedes 10:23, 20 September 2007 (EDT)

What about SymbolBuffer, an equivalent of StringBuffer which would conveniently provide all missing facilities when mutability is important - George 10/10/2007

BioJava name: Sometime in the past somebody (I guess was Mark) was proposed the change of the project name from BioJava to some other because the use of the "Java" word in it, the sugestion was a name like JBio or some similar (I don´t remember).

This is in the list of TODO yet or was discarded?

A change of the name at this moment could be convenient, since BioJava3 will be a new version of BioJava?

Guedes 15:06, 20 September 2007 (EDT)

It was decided a while ago (informally) not to make any change to the project name, as to do so could be confusing. However, others have suggested starting a completely new project in parallel to BioJava for the development of BioJava3, and other suggestions have included refactoring the existing code base instead of trying to write a new one. These would all impact differently on naming decisions. - RH 21/9/07

How would the memory and computational overhead of the proposed RecordSource object model compare to that of the current Sequence/Feature object model? Currently, reading in all associated annotation from a 5Mbp bacterial genome can use a very, very large amount of heap space (>25Mb?). Lazy loading has the potential to avoid the problem, but how would one ensure fast access to arbitrary data elements stored in the file backing the record? For example, would there be a way to load a specific data element (e.g. GO code, db xref, or function) from an annotated feature without having to completely parse the backing file again? Could we use random access file I/O or memory mapped file I/O to skip directly to the relevant portion of the backing file? -- Aaron Darling 02/10/2007

Sounds like a neat idea. Maybe some kind of hierarchical parser would work, as most files have a chunk-like format (e.g. Genbank with a header chunk, references chunk, feature table chunk, and sequence chunk). An initial parse could mark out the four chunks, with each chunk being broken into sub-chunks as required (e.g. getFeatures() would break the feature table chunk into individual feature chunks, and a call on a method on each feature would cause that individual feature to be decomposed further). If using files this could be done with random access and seek(), storing coordinates inside the sequence object. Or, if using streams, it could store the raw data inside the sequence object. Weak references could be used to store decomposed objects to prevent memory overload - only objects actually in use by the user would remain parsed, others would be parsed on demand, if necessary repeatedly. This is definitely something to think about. -- RH 3/10/07

BioJava SPI: Some discussion was held on the mailing list about plugins and SPI architecture for biojava. 3 Sun have released an [article] on how this can easily be achieved now that the ServiceLoader API has been exposed in JSE6 --Mark 07:50, 3 October 2007 (EDT)

Developers will expect the ServiceLoader API to work. However, this API does not allow new services to be offered at runtime. Also, there must be a built-in way to chose when several providers are available.

Particular implementations of a file/sequence format could be provided by querying the SPI. --George 10/10/2007

Contents

JAF for data manipulation

One of my goals for biojava3 would be the ability to manipulate and integrate a vast diversity of data and data types. An interesting possibility would be to make use of something like the [JavaBeans Activation Framework]

The description of this API says: With the JavaBeans Activation Framework standard extension, developers who use Java technology can take advantage of standard services to determine the type of an arbitrary piece of data, encapsulate access to it, discover the operations available on it, and to instantiate the appropriate bean to perform said operation(s). For example, if a browser obtained a JPEG image, this framework would enable the browser to identify that stream of data as an JPEG image, and from that type, the browser could locate and instantiate an object that could manipulate, or view that image.

While this was obviously intended as a way to deal with mime types etc it should be possible to enable some functionality whereby a BioJava program can recognize a Fasta object or a Blast result or a MicroArray image etc and instantiate the appropriate Beans with the appropriate methods. --Mark 04:08, 7 October 2007 (EDT)

There were some attempts in the past to have chemical mimetypes [[1]]. Biojava3 could be the starting point for biology-oriented mime types (and more logical one too, e.g. text-based format mime types should start with "text/" like in "text/biology-seq-fasta" so that there would be recognized by text editors(etc.)). - George 09/10/07

MIME types are a core part of JAF and are referenced in all its interfaces. As much biological data is distributed as .txt files, the default JAF DataSource implementations would only ever pick up a MIME type of 'application/text' at best. The default implementations get their MIME types from mapping filename extensions using a file in the JAR called 'mailcap', or from the MIME parts of an email, or otherwise it gets them direct from a webserver via the HTTP protocol. We'd have to write an entirely new DataSource for use with biological data files that instead does some kind of guessing that doesn't cause any data in the stream to be lost (e.g. by reading a few bytes then not being able to rewind). Not easy, but not impossible. The hardest part would be the format guessing. -- RH 10/10/07.

Singletons, Beans and JNDI

There is a lot of discussion about the various merits of beans, singletons and serialization. Beans tend to serialize and de-serialize quite well without special consideration. But beans with public constructors make for bad singletons. Lets face it, although they are a pain they are one of the most elegant aspect of the Symbol package and heavily influence DP, dist and Sequence.

So would it be possible to get the best of both worlds with [JNDI]. This is the mechanism by which JEE allows for Singleton or Singleton like behavior with beans (usually EJB but not always). I think BioJava stuff like the AlphabetManager (well parts of it) could be delegated to a JNDI provider.

Normally JNDI comes as part of a appserver like tomcat or an enterprise server like JBoss. However, there are standalone implementations like [Naming] in which is part of apache commons DBCP that could be used in the absence of a server of some kind. Sun also provides JNDI independently of the full JEE5 making use of a service provider interface (SPI). BioJava's registry's of objects could be plugged in as an SPI. --Mark 04:22, 7 October 2007 (EDT)

JNDI would be cool and would certainly be a viable solution for the singleton problem for alphabet symbols. It would also make BioJava accessible to users of farms and other distributed computation systems which could use this to share singletons across multiple machines. We'd have to think carefully about the default provider for alphabets+symbols - ie. which package it should be based on (or should we write our own simple implementation), and whether it should be modifiable by the user, and if so how persistent those modifications are. -- RH 10/10/07


One Test Per Class

We would write a JUnit test for every single class, writing the test first then the class afterwards.

I think comprehensive unit tests are a great idea and I am a strong proponent of test-driven design (TDD), so it pleases me to see the test-first phrase here. However, one test per class is not always a good organization (incidentally, I understand this to mean one test class per class). Quoting from Dave Astels' Test-Driven Development: A Practical Guide, pp. 74-75:

Let's begin by considering TestCase. It is used to group related tests together. But what does related mean? It is often misunderstood to mean all tests for a specific class or specific group of related classes. This misunderstanding is reinforced by some of the IDE plugins that will generate a TestCase for a specified class, creating a test method for each method in the target class. These test creation facilities are overly simplistic at best, and misleading at worst. They reinforce the view that you should have a TestCase for each class being tested, and a test for each method in those classes. But that approach has nothing to do with TDD, so we won't discuss it further.

This structural correspondence of tests misses the point. You should write tests for behaviors, not methods. A test method should test a single behavior.

TestCase is a mechanism to allow fixture reuse. Each TestCase subclass represents a fixture, and contains a group of tests that run in the context of that fixture. A fixture is the set of preconditions and assumptions with which a test is run. It is the runtime context for the test, embodied in the instance variables of the TestCase, the code in the setUp() method, and any variables and setup code local to the test method."

I'm not trying to appeal to authority here; it's just that a) Dave is the one who helped me past the notion of one test class per class; and b) he explains it very clearly. Yes, let's have TDD; yes, let's have comprehensive unit test coverage; but no - don't force the test cases into an awkward structure. Some classes will require multiple test classes, and some test classes may cover multiple classes. --Carl Manaster 12:10, 8 November 2007 (EST)

The following points were raised by users on biojava-dev, so I'm copying them here for completeness:

  • Immutable classes, not beans, are the safest and most efficient.
  • Statements like "We would adhere rigidly to a common coding style and heavily comment the code." only work if there is a way to measure and enforce it. A better way to say this would be something like "All committed code must have zero style errors as measured by Checkstyle with our local checkstyle configuration, have zero coding errors as measured by FindBugs with our local findbugs configuration, have 95% unit test coverage as measured by Cobertura", and so on.
  • Sequence features aught to align with the Sequence Ontology (SO/SOFA). The two top-level sequence feature types are Region (SO:0000001, "A sequence_feature with an extent greater than zero.") and Junction (SO:0000699, "A sequence_feature with an extent of zero.").
  • Is there a role for biojava in any of
    • Ontologies in OBO format
    • Ontologies in OWL format
    • BioPAX
    • MAGE-ML/SOFT/MINiML
    • FuGO/OBI
    • SMBL/CellML
    • Web services/BioMoby
    • etc.
  • I have created a small example of how a biojava 3 might be built in a more modular fashion using maven2.
  • W3C Semantic Web Health Care and Life Sciences Interest Group - The group might welcome help implementing their recommendations and best practices. Plenty of interesting discussion at any rate.

-- RH 9/11/07

There should be plenty of HOW-TOs to accomplish many of the tasks BioJava is designed to do. --Golharam 20:15, 2 January 2008 (EST)

Sequence-Focused

What the heck does this mean:

*  It is sequence-focused. Users have moved on. 

I use BioPerl to compare and manipulate DNA sequences. Should I skip BioJava3? This should be either clarified or removed.

This should have been made clearer - BioJava is so heavily concentrated on the concept of sequence that it is impossible to do any work with it without somehow invoking the concept. For example you can't analyse features without also loading the associated sequences, and you can't do database searches for identifiers without also loading the sequences those identifiers belong to. Users need to be able to do these things without the overhead of loading unnecessary sequence data. Previously it didn't matter as the sequence data was often more important than the metadata around it, but much research now is on the metadata rather than the sequence data. BioJava 3 will naturally still support sequence tasks but it will no longer insist that everything you do is somehow associated with a sequence. -- RH 8/5/08.


BioJava 1.6 already contains other packages that are not necessarily sequence related. E.g the protein structure package provides a set of useful tools when working with PDB data. --Andreas 13:33, 8 May 2008 (UTC)

Personal tools