Google Summer of Code 2010


The Open Bioinformatics foundation is applying to participate in the Google Summer of Code.

We are accepting applicants for projects for BioJava. If you want to propose a project, have a look at the page, for areas which are currently under development.

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program.



BioJava offers the following Google Summer of Code projects:

All-Java Multiple Sequence Alignment (MSA)
Develop an all-Java implementation of a multiple sequence alignment algorithm.

Rationale : Multiple sequence alignment is a frequently performed task in sequence analysis with the goal to identify new members of protein families and infer phylogenetic relationships between proteins and genes. At the present there is no Java-only implementation for this algorithm. As such the number of already existing and Java related BioInformatics tools and web sites would benefit from this implementation and sequence analysis could be more easily performed by the end-user. BioJava at the present already contains implementations for pairwise alignments and tools to create phylogenetic trees. This project will combine these tools in order to create a new implementation for this problem.

Approach : The multiple sequence alignment algorithm will consist of 3 steps:

:# Pairwise sequence alignments of all sequences will be calculated. BioJava already contains code for this. This code needs to be updated to be compliant with the new BioJava 3.

:# The results of the pairwise alignments are used to build up a distance matrix. This matrix is used to construct a tree using the Neighbor Joining Algorithm.

:# Apply a strategy similar to CLUSTALW to progressively build up the multiple alignment. Align closer related sequences first and extend the alignments to incorporate more distantly related sequences. Apply sequence weighting to correct for closely related sequences and apply residue specific gap penalties.

Challenges : Requires to join a number of existing tools into a unique solution. A successful student will have prior experience in software development in Java and will have to learn and modify various tools already provided through BioJava. Step 3 contains probably most risk. As such a first implementation will be based on a straightforward approach for building up the MSA progressively. If there is more time left during the project, more advanced rules can get implemented.

Involved toolkits or projects : Core, Alignment and Phylogeny modules of BioJava3

Degree of difficulty and needed skills : Difficult. Interested students should have a general knowledge of alignment algorithms and experience in Java-based software development.

Mentor: Andreas Prlic , Co-Mentors: Scooter Willis, Kyle Ellrott

Student: Mark Chapman

Project overview, timeline, and updates: Improvements including Multiple Sequence Alignment Algorithms

Identification and Classification of Posttranslational Modification of Proteins
Develop a Posttranslational Modification package for the BioJava project.

Rationale : Posttranslational modifications (PTM) 1 are modifications to proteins after protein biosynthesis that modulate protein function. PTMs are chemical modification or additions to amino acids in protein chains. These PTMs are present in the 3D structures of the Protein Data Bank. A frequently asked question is to query or classify proteins by their PTMs. The goal of this project is to develop a BioJava package that first identifies these modifications and then classifies them by the type of PTM. Controlled vocabulary will be used to uniquely annotate PTMs. For glycosylated proteins, the linkage patters will be established and presented as linear text or 2D graphical representations using the guidelines from the Consortium for Functional Glyconomics 2.

Approach : The PTM identification and classification will include the following steps:

:# Establish a list of known PTMs and write code to locate these PTMs in a 3D protein structure.

:# Determine the protein residues that carry PTMs based on distance thresholds.

:# Traverse the sugar molecules and establish their link pattern based on connectivity.

:# Present the PTMs as text in a linear notation and 2D graphical representations if time permits.

Challenges : Learn how to apply algorithms to problems in structural bioinformatics. Develop an object oriented data representation of PTMs. Apply good software engineering practices.

Involved toolkits or projects : BioJava3, Eclipse IDE

Degree of difficulty and needed skills : Difficult. Interested students should have a general knowledge of chemistry and biology, and in particular protein structures, and experience in Java-based software development. Experience with Java Swing would be a plus.

Mentor: Peter Rose

Student: Jianjiong Gao

More information