Downloading PubMed Articles

The PubMed Central Open Access archive, contains the full text of a subset of the articles available through PubMed.  It’s very useful for text mining.  In this blog entry, I’ll show you how to search for and download the XML-encoded articles using a simple BioGroovy shell script.

To perform the search, we’re going to use NCBI’s eUtils eSearch service.  We set the database to pmc (PubMed Central), and provide a search string.  In this particular example we want to look for articles related to metastasis in pancreatic cancer cases.  We want to constrain the results to only those PMC articles that provide the full text of the article, so we add the “free+fulltext[filter]” MeSH term to the query.

We use Groovy’s XmlSlurper class to download and parse the eSearch results.  We’re interested in getting the pmcid’s for each of the articles. The XML returned from eSearch looks something like this:

<esearchresult>
<count>1451</count>
<retmax>20</retmax>
<retstart>0</retstart>
<idlist>
<id>1388210</id>
<id>2689419</id>
<id>2504036</id>
<id>1971219</id>
<id>2020767</id>
<id>2435493</id>
<id>2685435</id>
<id>2692690</id>
<id>2481517</id>
<id>2094388</id>
<id>2694379</id>
<id>2639538</id>
<id>2225522</id>
<id>2077912</id>
<id>1361347</id>
<id>1803032</id>
<id>2276419</id>
<id>2292750</id>
<id>2652408</id>
<id>2700408</id>

</idlist>..

</esearchresult>

The XPath expression used to get these nodes is: “/IdList/Id”.  As a gPath expression this looks like “IdList.Id”.  Note that we don’t need to explicitly use the root node name (“eSearchResult”) in the gPath expression.  Indeed, if you do try to use it, the expression will return no results.

In order to fetch each article we’re going to use the NCBI eFetch service.  We want to iterate over each ID, and append the ID to the eFetch URL.  eFetch is another NCBI REST service used to download specific data records from NCBI’s database.  I construct a file name for each article, and output the article to its own file.

#!/usr/bin/env groovy
/**
* This script performs a search for Pancreatic Cancer Metastasis papers which are
* available as free fulltext.  It then downloads each of these papers in XML format.
*/
def eUrl = “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=metastasis+AND+pancreatic+cancer+AND+free+fulltext%5Bfilter%5D&#8221;;

def fetchUrl = “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&#8221;

def doc = new XmlSlurper().parse(new URL(eUrl).openStream());
doc.IdList.Id.each(){
println “pmcid: ${it} ”
String url = fetchUrl + “&id=${it}”;
String filename = “${it}.xml”
def file = new FileOutputStream(filename)
def out = new BufferedOutputStream(file)
out << new URL(url).openStream()
out.close()
};

The only caveat with this approach is that this only works for the subset of PMC articles that are available in XML format.  You can miss a number of significant papers this way since many authors do not provide their articles in this format.  The majority of the articles are available in PDF format, and a large number of these are only available from the publisher’s site, and not through NCBI.

Advertisements

About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Bioinformatics, Informatics, Uncategorized and tagged , , , , . Bookmark the permalink.

2 Responses to Downloading PubMed Articles

  1. Pingback: My NCBI Redesign (Personal Search Saving & More Tool for PubMed searches) « Health and Medical News and Resources

  2. Tim Pizey says:

    Thank you, copy and pasting fromt he blog to Ubuntu gedit using Firefox gave me illegal double quotes, but once I had that fixed I was away.

    thanks again
    Tim

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s