BD2K – The NIHs Big Data To Knowledge Effort

The topic at a recent Bioinformatics lunch discussion was the NIH’s Big Data To Knowledge (BD2K) program.  The BD2K’s Mission Statement (below) gives you some ideas for where the program is heading…

BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement.

The BD2K initiative addresses four major aims that, in combination, are meant to enhance the utility of biomedical Big Data:

  • To facilitate broad use of biomedical digital assets by making them discoverable, accessible, and citable.
  • To conduct research and develop the methods, software, and tools needed to analyze biomedical Big Data.
  • To enhance training in the development and use of methods and tools necessary for biomedical Big Data science.
  • To support a data ecosystem that accelerates discovery as part of a digital enterprise.

Overall, the focus of the BD2K program is to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of Big Data and data science into biomedical research. [BD2K Mission Statement]

But as the interview below, with project director and noted bioinformaticist Phil Bourne indicates, the scope of the initiative spans the entire research life cycle (ideas, hypotheses, experiments, data, analysis, comprehension, dissemination). In concrete terms that means tools for authoring, collecting, analyzing and visualizing and data.  In part, the project hopes to address some of the issues around reproducibility, discoverability and provenance of data and software that have impeded the ability of industry to leverage academic research.[Assessing Credibility Via Reproducibility] [Reproducibility and Provenance]

The Commons is a program within the BD2K initiative that seeks to improve the discoverability of data, provide open APIs (data and tools), unique IDs for research objects, containers for packaged applications, running in cloud and HPC environments.

My goal, at this month’s meeting was to learn more about the program from some of the local participants.

Ben Good from the Su Lab at The Scripps Research Institute described some of the work they were doing.  Earlier this year, they held a Network of Biothings/BD2K Hackathon to kickstart some projects.  During that meeting the

Chunlei Wu, also of the Su Lab, is developing a “Community Platform for Data Wrangling of Gene and Genetic Variant Annotations”.

At the PDB, Peter Rose has been working on a compression technique for 3D protein structures as part of the BD2K’s Targeted Software Development program.  This technique makes it possible to stream complex 3D structures in the same way that you might stream a YouTube video.

Posted in Uncategorized | Leave a comment

Fetching Data With BioGroovy

With BioGroovy 1.1 we’ve added support for fetching data from a variety of RESTful webservices, including: EntrezGene, PubMed, UniProt and many others.

Fetching Data
In this example, we’ll fetch gene information from EntrezGene for 3 genes, and output the result to the console.

import org.biogroovy.models.Gene;
@Grab(group='org.biogroovy', module='biogroovy', version='1.1')

EntrezGeneSlurper slurper = new EntrezGeneSlurper();

println “Gene”
println “Symbol\tEntrezGeneID\tName”

List<Gene> geneList = slurper.fetchAll(‘675,1034,133’);
geneList.each{Gene gene ->
println “${gene.symbol}\t${gene.entrezGeneId}\t${}”

In the slurper.fetchAll() method call, we pass a string containing a comma-delimited list of EntrezGene IDs.  This return a list of Gene objects.  We iterate through the gene list and print the results out to the console.

Posted in Bioinformatics, Informatics | Tagged | Leave a comment

BioGroovy and Web Service Identity

With the recent release of BioGroovy 1.1, we added support for a number of web services. Web service providers like NCBI’s eUtils, EntrezAjax and JournalTOCs have requirements for tracking usage of their services.  In some cases, a token must be passed along with each request that identifies the tool making the request, or a specific user email address.

To support this type of interaction, a BioGroovyConfig class was added.  In this example, we’ll see how the EntrezGeneSlurper class takes advantage of BioGroovyConfig.  In the constructor for EntrezGeneSlurper you’ll see the following snippet:

ConfigObject conf = BioGroovyConfig.getConfig();
this.tool = conf.eutils.tool =

The first line looks for a biogroovy.conf file in your ~/.biogroovy directory. If it doesn’t see a file there, it copies the default configuration into this directory, and throws an exception, letting the user know that they’ve failed to configure the biogroovy.conf file properly. The default configuration file does not contain any real identity information, and so it must be updated with real information in order to be used. Here’s an example of what the default file looks like:

eutils = [
tool : 'biogroovy',
email :''

The biogroovy.conf file is in reality a groovy file that is parsed as a groovy ConfigObject. In the first line, we’re declaring the eutils properties, tool and email that need to be sent with each request. In the second line we’re setting the journaltocs.userid property, and in the third line; the EntrezAjax userid. In each of these cases, you’ll need to replace these default values with your own values.  The links in this paragraph will take you to the registration pages for these services.

After you’ve configured the biogroovy.conf file, you can run the EntrezGeneSlurperTest, and see the results.

Posted in Bioinformatics, Informatics | Tagged , , | Leave a comment

Tinkering With BioGroovy 1.1

Since the initial release of BioGroovy, a lot has changed, and the library has continued to grow substantially. With the recent BioGroovy 1.1 release, I thought I would review some of the changes, and update the information on how you can get started using BioGroovy. Here’s a brief list of some of the changes:

  • Support for new model objects, including Drug, ClinicalTrial
  • A new search engine client API with support for EntrezGene, PubMed, and
  • Refactored IO framework that supports:
    • direct fetching of data, and mapping into model objects (“fetch KRAS from EntrezGene [id=3845] and return the result as a Gene object”).
    • caching of data in your local file system to make unit testing your code easier and to reduce the load on external web services.
    • New fetchers to support fetching data from EntrezGene, PubMed, UniProt,, OMIM,, JournalTOCS, Chembl, and PubChem
    • The frameworks also support either the use of JSON results, or XML.
    • Support for web service identity.

In addition to these changes, we’re also publishing the BioGroovy binaries, source and documentation through Bintray.  This means that you’ll want to update your .groovy/grapeConfig.xml using the instructions found here.

BioGroovy Models
BioGroovy uses POGOs (Plain Old Groovy Objects) to hold commonly accessible data. In a typical usecase, you might want to fetch a list of Genes from EntrezGene, and write the results out in an Excel file or CSV file, or to a database. With the 1.1 release, we’ve added support for ClinicalTrial objects, Drugs, Journals, RSS feeds.  We’ve also added support for clustering, to let users generate graphs of data that can be rendered using Cytoscape.  For example, you can cluster a set of genes by GO terms, and export the result as a SIF file using the Go2SIFClusterWriter. You can cluster articles by keywords, journal or MeSH terms.  You can also use the FrequencyMap object to create a simple map of the number of occurrences of a particular object.

Posted in Bioinformatics | Tagged , , | Leave a comment

Network of BioThings Hackathon #2

This weekend’s Network of BioThings hackathon (#hackNoB) was a great success with some really innovative projects making great strides in a short amount of time.

The event was hosted by the Jeff Grethe of the Neuroscience Information Framework group at UCSD, and organized by Ben Good (Su Lab @ TSRI), Dexter Pratt (NDEx project) with the help of many others, and sponsored by NDEx, San Diego Center for Systems Biology, and the International Society for Biocuration.

This year’s winner was the Citizen Science team, who hacked the BRAT web-based document annotation tool.  This application, lets citizen scientists annotate abstracts with gene, drug, and disease information along with the connections between these semantic types. A tool like this promised to make it easier to extract relevant facts from the avalanche of publications.

This may, at some point be used to feed annotated journal articles into a project like CIViC (Clinical Interpretation of Variants in Cancer).

In second place was the SBiDer project (a tool for developing Synthetic Biocircuits).


Posted in Bioinformatics, Cancer Research | Tagged , , | 3 Comments

Google Docs for Scientists

Science is inherently a collaborative effort, and at least once a month I encounter someone who mentions in passing some trial or tribulation they had when sharing documents.  The story usually goes like this…

We were working on a presentation/paper for a meeting.  Everyone had last minute changes, new data to share, and somehow, someone accidentally picked up the wrong version of the document and started editing.  Everyone was frustrated because, they had to get their updates in, and they were all waiting for Joe to finish his changes.  Joe went to lunch and left the file open and no one could get any work done, etc.

Continue reading

Posted in Informatics, Science Blogging, Uncategorized | Leave a comment

Google+ For Scientists

Recently, I’ve been thinking about the role that social media plays in science. And while friends are fond of pointing out that I’ve drunk the Kool-Aid, colleagues at various labs usually shoot me a quizzical look when I bring up the subject.

Perhaps the biggest benefit to social media is that it acts like a peer-reviewed lens that brings the latest developments in your field into view — a conference that runs 24-7 and acts as a form of social democratization for scientific thought. Someone you wouldn’t dream of approaching in the real world, is instantly more accessible on Twitter, Facebook or Google Plus. They are also more frank than they might be if you approached them in person.

Continue reading

Posted in Science Blogging, Social Media | Tagged | Leave a comment