PubMed vs EuropePMC: Let’s Get Ready To Rumble

lets-get-ready-to-rumblePubMed

For most researchers, PubMed is the go-to resource for all biomedical literature. But from a programmatic standpoint it has some real challenges that make it difficult to integrate into many informatics applications.  Let’s take a look at a typical application.

Suppose we have an internal application used for target identification and tracking, and we want to add the ability to perform literature searches, and add selected hits to a specific target.

To do this using PubMed’s eUtil’s API requires two calls which have overly verbose results that must be parsed.  (Click on the Sample Call links below to see an example of what the call looks like, as well as the server response).

  1. Perform a search, and get a list of PubMed IDs. [Sample Call]
  2. Fetch the PubMed records, allow the user to review them, and then save the selected records. [Sample Call]

The first problem is that the search only returns IDs and search metadata. It doesn’t return titles, or abstracts, or anything else that a user might find useful in making a decision about which article to download or view.

The second problem is that the results when fetching PubMed articles are too verbose. The response is only available in XML, not JSON, and this has a performance impact. For example, all of the dates found in the record appear as separate tags.

<DateCreated>
<Year>2017</Year>
<Month>01</Month>
<Day>27</Day>
</DateCreated>
Rather than:
<date-created date="2017-01-27"/>
or even
<date-created year="2017" month="01" day="27"/>

That’s 79 characters vs 33 or 47 (depending on which format you prefer).

A simple author name appears like this is:

<Author ValidYN="Y">
<LastName>Tao</LastName>
<ForeName>Huimin</ForeName>
<Initials>H</Initials>
</Author>
where this would do
<author lastname="Tao" firstname="Huimin" initials="H"/>
That’s 107 characters vs 56.
On the surface these seem like niggling complaints, but when you take into account the fact that the record size negatively impacts the speed and responsiveness of your application, and the amount of memory and processing power required to parse the data, then it has some serious implications for your application. For each author or date you could reduce the number of characters by half.
Aside from the verbosity of the results though, PubMed does not attempt to text mine abstract data. The record does not contain gene, protein, pathway or compound information which would make it truly useful in a drug discovery or literature mining application. The closest we come to getting article metadata are the MeSH (Medical Search Heading) terms.
Although BioGroovy makes it easy to search, download and parse PubMed records; it (like other libraries and applications) is not immune to the limitations of the eUtils API.

EuropePMC

Perhaps the best alternative to PubMed is EuropePMC.  The database includes both PubMed abstracts, and PubMed Central full text articles.  The EuropePMC API provides you with both XML and JSON response formats. Let’s take a look at our previous algorithm, and how EuropePMC’s API differs from PubMed’s.

  1. Perform a search. [Sample Call]
  2. Fetch the selected records [Sample Call]

One of the first things you’ll notice is that the search results actually contain useful information.  In the sample below, we can see a title, the DOI, a well-formatted author, the journal. We can even see if the article has text-mined terms associated with it.

{
id: "28094263",
source: "MED",
pmid: "28094263",
doi: "10.1038/nrclinonc.2017.3",
title: "Pancreatic cancer: Pancreatic cancer cells digest extracellular protein.",
authorString: "Sidaway P.",
journalTitle: "Nat Rev Clin Oncol",
pubYear: "2017",
journalIssn: "1759-4774; 1759-4782; ",
pubType: "journal article",
isOpenAccess: "N",
inEPMC: "N",
inPMC: "N",
hasPDF: "N",
hasBook: "N",
citedByCount: 0,
hasReferences: "N",
hasTextMinedTerms: "N",
hasDbCrossReferences: "N",
hasLabsLinks: "Y",
epmcAuthMan: "N",
hasTMAccessionNumbers: "N"

}

What makes this especially useful is the results can easily be used in a user interface, and contain enough information to allow a user to determine whether or not the article is potentially useful.

You can also fetch text mined terms, such as genes, diseases, and chemicals from EuropePMC records as well. [Sample Call]  For example, in the previous call we’re returning all terms from a particular record. One of those terms is a record for the chemical taxol which is used as a chemotherapeutic agent. The compound metadata includes information from the CHEBI chemical database.

 

Advertisements

About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Bioinformatics, Informatics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s