Parsing Data Files With BioGroovy

In a previous post we showed how you can use BioGroovy to download a data file from a webservice.  In this post, we’ll show you how to parse the file that you’ve downloaded.

After handling all of the imports (shown in the first block of code), the real work starts with the EntrezGeneSlurper.  We create a file reference to the datafile we downloaded earlier, and then call the fetcher.read() method to read the datafile and print the results out to the console.

Grab (group='org.biogroovy', module='biogroovy', version='1.1')
import org.biogroovy.io.eutils.EntrezGeneSlurper
import org.biogroovy.models.Gene
import java.io.File
import java.io.FileInputStream


EntrezGeneSlurper fetcher = new EntrezGeneSlurper();
File file = new File("/path/to/file.gene.xml");
Gene gene = fetcher.read(new FileInputStream(file));
println gene;

Advertisements
Posted in Uncategorized | Leave a comment

Fetching Data Files With BioGroovy

It’s often useful to fetch data from a web service and cache it in your local directory. This cuts down on the amount of traffic between your machine and the web service you’re invoking, and it makes unit testing your code easier. In this example, we’ll fetch data for 3 genes and write it out as XML files.

 import org.biogroovy.io.eutils.EntrezGeneSlurper;
 import org.biogroovy.models.Gene;
 @Grab(group='org.biogroovy', module='biogroovy', version='1.1')

 EntrezGeneSlurper slurper = new EntrezGeneSlurper();
 File dir = new File(System.getProperty('user.home'), 'biogroovy_test_files');
 dir.mkdirs()
 slurper.fetchAsFile(dir, "675,1034,133")

 // double check that the files were written out
 dir.list().each{println it}
Posted in Bioinformatics, Informatics | Tagged | 2 Comments

Groovy vs Python

At a recent Informatics Lunch meeting, surrounded by a host of dedicated Python users I confessed to being somewhat ignorant of the business drivers behind using Python in a bioinformatics setting. Although, I have used Python on occasion, it was usually to solve a customer’s problem, and not my first choice when reaching into the toolbox.  My feeling is that Python has become the de facto standard for bioinformatics scripting more as an historical fluke rather than as a result of duking it out for top-of-the-technology pile against other languages.

Oddly enough, during that conversation, a lot of the conversation floated around Python’s shortcomings, rather than its strengths.  Some of these shortcomings included the Python 2.0 vs 3.0 backwards incompatibility problem, and the fact that compiling Python into native code is tricky and requires you to keep in mind platform-specific differences. Luckily, neither of these problems have been inflicted on Java or Groovy users.

Some of the typical use-cases cited for using Python, include:

It’s scriptable

Groovy has always been scriptable. The Groovy console lets you compose and execute scripts.  It can also be compiled to Java byte-code and thus run on any JVM. With the release of Java 9, Java itself is also scriptable using the JShell scripting console.

I can read data files with it

The Apache POI library provides support for reading and writing Microsoft Office files like PowerPoint, Word and Excel. Here’s a simple example that shows you how to iterate through the cells in an Excel spreadsheet.

// read the Excel file
Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));

// get the first sheet
Sheet sheet = wb.getSheetAt(0);

// Decide which rows to process
int rowStart = Math.min(15, sheet.getFirstRowNum());
int rowEnd = Math.max(1400, sheet.getLastRowNum());

for (int rowNum = rowStart; rowNum < rowEnd; rowNum++) {
 Row r = sheet.getRow(rowNum);
 if (r == null) {
 // This whole row is empty
 // Handle it as needed
 continue;
 }

 int lastColumn = Math.max(r.getLastCellNum(), MY_MINIMUM_COLUMN_COUNT);

 for (int cn = 0; cn < lastColumn; cn++) {
 Cell c = r.getCell(cn, Row.RETURN_BLANK_AS_NULL);
 if (c == null) {
 // The spreadsheet is empty in this cell
 } else {
 // Do something useful with the cell's contents
 }
 }
}

I can query databases with it

Groovy can make use of any JDBC data source, NoSQL database, graph database, or cloud database.  Here’s a simple example that shows you how to execute a JDBC query and iterate through the results.

// create a database connection to an in-memory hsql database
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbc.JDBCDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
//query data table called 'project'
sql.eachRow('select * from PROJECT', 2, 2) { row ->
     println "${row.name.padRight(10)} ($row.url)"
 }

I can do text mining with it

The WEKA library is designed with text mining in mind.  Since the library is Java-based, we can easily add it to a project, and execute it as a script.  There are plenty of examples that show how to use WEKA here.

I can graph data with it

The Java UI library JavaFX includes a variety of graphing components including pie charts, line charts, area charts, etc. The subject is a little beyond the scope of this post, but Oracle provides some great tutorials on the subject here.  And here’s an example from StackOverflow that shows how to output the chart as a PNG image without displaying it.

Conclusion

These are just some of the use-cases that were cited during the meeting.  So, I’d like to throw it out there to the audience — if you have some additional use cases for Python, I’d love to hear more about them in the comments section.

 

 

Posted in Uncategorized | Leave a comment

BD2K – The NIHs Big Data To Knowledge Effort

The topic at a recent Bioinformatics lunch discussion was the NIH’s Big Data To Knowledge (BD2K) program.  The BD2K’s Mission Statement (below) gives you some ideas for where the program is heading…

BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement.

The BD2K initiative addresses four major aims that, in combination, are meant to enhance the utility of biomedical Big Data:

  • To facilitate broad use of biomedical digital assets by making them discoverable, accessible, and citable.
  • To conduct research and develop the methods, software, and tools needed to analyze biomedical Big Data.
  • To enhance training in the development and use of methods and tools necessary for biomedical Big Data science.
  • To support a data ecosystem that accelerates discovery as part of a digital enterprise.

Overall, the focus of the BD2K program is to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of Big Data and data science into biomedical research. [BD2K Mission Statement]

But as the interview below, with project director and noted bioinformaticist Phil Bourne indicates, the scope of the initiative spans the entire research life cycle (ideas, hypotheses, experiments, data, analysis, comprehension, dissemination). In concrete terms that means tools for authoring, collecting, analyzing and visualizing and data.  In part, the project hopes to address some of the issues around reproducibility, discoverability and provenance of data and software that have impeded the ability of industry to leverage academic research.[Assessing Credibility Via Reproducibility] [Reproducibility and Provenance]

The Commons is a program within the BD2K initiative that seeks to improve the discoverability of data, provide open APIs (data and tools), unique IDs for research objects, containers for packaged applications, running in cloud and HPC environments.

My goal, at this month’s meeting was to learn more about the program from some of the local participants.

Ben Good from the Su Lab at The Scripps Research Institute described some of the work they were doing.  Earlier this year, they held a Network of Biothings/BD2K Hackathon to kickstart some projects.  During that meeting the

Chunlei Wu, also of the Su Lab, is developing a “Community Platform for Data Wrangling of Gene and Genetic Variant Annotations”.

At the PDB, Peter Rose has been working on a compression technique for 3D protein structures as part of the BD2K’s Targeted Software Development program.  This technique makes it possible to stream complex 3D structures in the same way that you might stream a YouTube video.

Posted in Uncategorized | Leave a comment

Fetching Data With BioGroovy

With BioGroovy 1.1 we’ve added support for fetching data from a variety of RESTful webservices, including: EntrezGene, PubMed, UniProt and many others.

Fetching Data
In this example, we’ll fetch gene information from EntrezGene for 3 genes, and output the result to the console.

import org.biogroovy.io.eutils.EntrezGeneSlurper;
import org.biogroovy.models.Gene;
@Grab(group='org.biogroovy', module='biogroovy', version='1.1')


EntrezGeneSlurper slurper = new EntrezGeneSlurper();

println “Gene”
println “Symbol\tEntrezGeneID\tName”

List<Gene> geneList = slurper.fetchAll(‘675,1034,133’);
geneList.each{Gene gene ->
println “${gene.symbol}\t${gene.entrezGeneId}\t${gene.name}”
}

In the slurper.fetchAll() method call, we pass a string containing a comma-delimited list of EntrezGene IDs.  This return a list of Gene objects.  We iterate through the gene list and print the results out to the console.

Posted in Bioinformatics, Informatics | Tagged | Leave a comment

BioGroovy and Web Service Identity

With the recent release of BioGroovy 1.1, we added support for a number of web services. Web service providers like NCBI’s eUtils, EntrezAjax and JournalTOCs have requirements for tracking usage of their services.  In some cases, a token must be passed along with each request that identifies the tool making the request, or a specific user email address.

To support this type of interaction, a BioGroovyConfig class was added.  In this example, we’ll see how the EntrezGeneSlurper class takes advantage of BioGroovyConfig.  In the constructor for EntrezGeneSlurper you’ll see the following snippet:


ConfigObject conf = BioGroovyConfig.getConfig();
this.tool = conf.eutils.tool
this.email = conf.eutils.email

The first line looks for a biogroovy.conf file in your ~/.biogroovy directory. If it doesn’t see a file there, it copies the default configuration into this directory, and throws an exception, letting the user know that they’ve failed to configure the biogroovy.conf file properly. The default configuration file does not contain any real identity information, and so it must be updated with real information in order to be used. Here’s an example of what the default file looks like:

eutils = [
tool : 'biogroovy',
email :'goofy@disney.com'
]
journaltocs.userid='goofy@disney.com'
entrezajax.userid='5fdfa72e58fa4f2b4b0b4a3a3ab2a9b4'

The biogroovy.conf file is in reality a groovy file that is parsed as a groovy ConfigObject. In the first line, we’re declaring the eutils properties, tool and email that need to be sent with each request. In the second line we’re setting the journaltocs.userid property, and in the third line; the EntrezAjax userid. In each of these cases, you’ll need to replace these default values with your own values.  The links in this paragraph will take you to the registration pages for these services.

After you’ve configured the biogroovy.conf file, you can run the EntrezGeneSlurperTest, and see the results.

Posted in Bioinformatics, Informatics | Tagged , , | Leave a comment

Tinkering With BioGroovy 1.1

Since the initial release of BioGroovy, a lot has changed, and the library has continued to grow substantially. With the recent BioGroovy 1.1 release, I thought I would review some of the changes, and update the information on how you can get started using BioGroovy. Here’s a brief list of some of the changes:

  • Support for new model objects, including Drug, ClinicalTrial
  • A new search engine client API with support for EntrezGene, PubMed, and ClinicalTrials.gov.
  • Refactored IO framework that supports:
    • direct fetching of data, and mapping into model objects (“fetch KRAS from EntrezGene [id=3845] and return the result as a Gene object”).
    • caching of data in your local file system to make unit testing your code easier and to reduce the load on external web services.
    • New fetchers to support fetching data from EntrezGene, PubMed, UniProt, MyGene.info, OMIM, ClinicalTrial.gov, JournalTOCS, Chembl, and PubChem
    • The frameworks also support either the use of JSON results, or XML.
    • Support for web service identity.

In addition to these changes, we’re also publishing the BioGroovy binaries, source and documentation through Bintray.  This means that you’ll want to update your .groovy/grapeConfig.xml using the instructions found here.

BioGroovy Models
BioGroovy uses POGOs (Plain Old Groovy Objects) to hold commonly accessible data. In a typical usecase, you might want to fetch a list of Genes from EntrezGene, and write the results out in an Excel file or CSV file, or to a database. With the 1.1 release, we’ve added support for ClinicalTrial objects, Drugs, Journals, RSS feeds.  We’ve also added support for clustering, to let users generate graphs of data that can be rendered using Cytoscape.  For example, you can cluster a set of genes by GO terms, and export the result as a SIF file using the Go2SIFClusterWriter. You can cluster articles by keywords, journal or MeSH terms.  You can also use the FrequencyMap object to create a simple map of the number of occurrences of a particular object.

Posted in Bioinformatics | Tagged , , | Leave a comment