Groovy is a powerful scripting language that allows you to leverage a large number of Java libraries. This series of articles describes how you can combine Groovy with bioinformatics-related libraries to solve a variety of problems.

Traditionally, Perl has been seen as the language of choice for bioinformaticians. More recently, Python has been gaining ground. Both of these languages owe their popularity in large part to the BioPerl and BioPython libraries which permit users to perform a wide variety of tasks.

In a recent blog posting on 123Bioinformatics a variety of reasons were cited for using Perl to address bioinformatics problems. As I reviewed the list, I found myself replacing the word “perl” with “groovy”.

Perl Scripts are very easy for the String processing when using biological data like Genome sequences or protein sequences. You can use the javax.regex library (a standard library in Java) to parse files using regular expressions. Or you can use the Apache ORO library to support Perl, awk and sed regular expressions for processing strings. In addition BioJava does a great job of processing different types of sequence files.
File handling is easy in Perl. Ditto.
Perl regular expression is very flexible and easy to match similar patters rather than identical ones. It can be used in instance like matching a motif or a repeat in a sequence. Ditto.
There are no strict rules for writing Perl scripts like other languages. That makes it easy for the biologist to learn Perl in short period. You can use either Java syntax or Groovy syntax to create Groovy scripts. This makes it easy for someone familiar with Java to quickly pickup Groovy, or for a complete neophyte to learn as well.
Perl scripts can be combined with SHELL scripts for text processing. Groovy scripts can also be combined with shell scripts.
Using Perl CGI and HTML one can develop the Web pages. Perl CGI is very similar to Perl scripts. The GRAILS project gives you a Ruby On Rails model for the development of web sites.
CPAN contains hundreds of Perl Modules which are Specific for sequence analysis.
Eg: FASTAParse , Peptide::Pubmed .
Any Java, or Native library can be used from within Groovy. This makes it possible to use BioJava, Emboss, R, or virtually any other toolkit.
Perl can be used for System administration purpose also. You can create groovy shell scripts, just as you would perl scripts.
Perl Template tool kit is another Perl product which can be used for developing advanced web pages. Groovy has the GSP (Groovy Server Pages).
Using perl DBIx it is easier to pass mysql data (backend) to the web page(front end). You can make use of any JDBC driver to access data from a database. These drivers are crossplatform, so your code stays completely portable, you can even use Java’s built-in Derby database without having to install anything extra.
Processing / Parsing a HTML file is very easy by using CPAN modules. You can use the standard XML parsing libraries (javax.xml) to parse both XML and HTML.
File type conversion is possible in Perl using CPAN modules. Ex:Doc to PDF ,HTML to PDF ..Etc. The Apache POI library provides a variety of tools for reading/writing Microsoft Office documents. You can use FOP to convert HTML into PDF.
By using Perl Magick module we can do image processing. Groovy has a variety of similar options. The javax.imageio library provides a good starting point for image manipulation. But you can also add the ImageMagick Java wrapper.
Perl critic module will help you to write a best Perl codes by criticizing your code structure. Groovy plugins are available for most IDEs, making it easy to learn the language, and take advantage of some of the specialized groovy libraries.

But with the release of Java 6 and the direct support for scripting languages. The Java platform is becoming an interesting choice for bioinformaticians. You get the benefit of great performance, and support for a wide variety of operating systems. You also have large libraries that you can leverage to tackle a wide variety of tasks.

This gives the bioinformaticist or computational biologist the ability to create scripts that can be scaled into larger applications. This inspired me to write a series of blog entries which demonstrate some of the bioinformatics capabilities of groovy. For most of the examples in this series though, I’m going to focus on the everyday grunt-work that goes into genomic research, and leave application building to another day.


About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Bioinformatics, Informatics and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s