BioGroovy: Fetching PubMed Data

As part of the series of blog entries on BioGroovy, I thought I’d start with some rather simple tasks using some of the functionality found in NCBI’s eUtils API.

NCBI has both RESTful and SOAP APIs that you can use. But in the interests of keeping this simple, I’m going to stick with the REST API. The API consists of a series of CGI programs that you can call via a URL. We’ll be using the eFetch and eSearch services in these examples. To try out any of these samples, simply cut and paste the code into the Groovy Console, and click the Execute button. If you haven’t already installed Groovy, you’ll need to download it from here and follow the installation instructions.

Let’s suppose that you have a list of PubMed IDs that you’d like to turn into a series of citations. The eFetch program allows you to fetch data in a variety of different formats (as specified by the retmode parameter). You can find out more about eFetch and the other eUtils programs here. In this particular case, since we’re interested in retrieving our results as XML, we specify retmode=xml. We also need to set the database (db) parameter to “pubmed”. We can insert a comma-delimited list of PubMed IDs in the id parameter (id=11877539,11822933,11871444).

def eUrl = ",11822933,11871444&retmode=xml"
def doc = new XmlSlurper().parse(new URL(eUrl).openStream());
doc.PubmedArticle.each(){article ->
    println "pmid: ${article.MedlineCitation.PMID.text()} title: ${article.MedlineCitation.Article.ArticleTitle.text()} n"

We use the XmlSlurper class to parse the XML and give us back an XML document. We can then use GPath (a Groovy-fied form of XPath) to extract specific parts of the document. Here we want to extract the PubMed ID (PMID) and the ArticleTitle. In order to understand the GPath statements it helps to look at the XML itself. In the example above, we’re using GPath’s dot notation to traverse the XML — article.MedlineCitation.PMID.text(). This grabs the text node of the PMID element.

Searching for PubMed Data
Suppose instead of just fetching a set of PubMed articles, you want to search for PubMed articles. How can you do this programmatically?

def eUrl = ""
def doc = new XmlSlurper().parse(new URL(eUrl).openStream());
doc.IdList.Id.each(){pmid ->
    println "pmid: ${pmid.text()} n"

Using the eSearch tool, we can construct a query term. In this instance, I’m interested in looking for articles involving histone deacetylases and fungi. To make the URL valid, I use %20 instead of spaces. In a later article, I’ll show you how to construct URLs programmatically, thus making it easier to replace and format search terms.


About Mark Fortner

I write software for scientists doing drug discovery and cancer research. I'm interested in Design Thinking, Agile Software Development, Web Components, Java, Javascript, Groovy, Grails, MongoDB, Firebase, microservices, the Semantic Web Drug Discovery and Cancer Biology.
This entry was posted in Bioinformatics, Informatics and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s