PubMed
For most researchers, PubMed is the go-to resource for all biomedical literature. But from a programmatic standpoint it has some real challenges that make it difficult to integrate into many informatics applications. Let’s take a look at a typical application.
Suppose we have an internal application used for target identification and tracking, and we want to add the ability to perform literature searches, and add selected hits to a specific target.
To do this using PubMed’s eUtil’s API requires two calls which have overly verbose results that must be parsed. (Click on the Sample Call links below to see an example of what the call looks like, as well as the server response).
- Perform a search, and get a list of PubMed IDs. [Sample Call]
- Fetch the PubMed records, allow the user to review them, and then save the selected records. [Sample Call]
The first problem is that the search only returns IDs and search metadata. It doesn’t return titles, or abstracts, or anything else that a user might find useful in making a decision about which article to download or view.
The second problem is that the results when fetching PubMed articles are too verbose. The response is only available in XML, not JSON, and this has a performance impact. For example, all of the dates found in the record appear as separate tags.
<DateCreated>
</DateCreated>
<date-created date="2017-01-27"/>
<date-created year="2017" month="01" day="27"/>
That’s 79 characters vs 33 or 47 (depending on which format you prefer).
A simple author name appears like this is:
<Author ValidYN="Y">
</Author>
<author lastname="Tao" firstname="Huimin" initials="H"/>
EuropePMC
Perhaps the best alternative to PubMed is EuropePMC. The database includes both PubMed abstracts, and PubMed Central full text articles. The EuropePMC API provides you with both XML and JSON response formats. Let’s take a look at our previous algorithm, and how EuropePMC’s API differs from PubMed’s.
- Perform a search. [Sample Call]
- Fetch the selected records [Sample Call]
One of the first things you’ll notice is that the search results actually contain useful information. In the sample below, we can see a title, the DOI, a well-formatted author, the journal. We can even see if the article has text-mined terms associated with it.
{ id: "28094263", source: "MED", pmid: "28094263", doi: "10.1038/nrclinonc.2017.3", title: "Pancreatic cancer: Pancreatic cancer cells digest extracellular protein.", authorString: "Sidaway P.", journalTitle: "Nat Rev Clin Oncol", pubYear: "2017", journalIssn: "1759-4774; 1759-4782; ", pubType: "journal article", isOpenAccess: "N", inEPMC: "N", inPMC: "N", hasPDF: "N", hasBook: "N", citedByCount: 0, hasReferences: "N", hasTextMinedTerms: "N", hasDbCrossReferences: "N", hasLabsLinks: "Y", epmcAuthMan: "N", hasTMAccessionNumbers: "N"
}
What makes this especially useful is the results can easily be used in a user interface, and contain enough information to allow a user to determine whether or not the article is potentially useful.
You can also fetch text mined terms, such as genes, diseases, and chemicals from EuropePMC records as well. [Sample Call] For example, in the previous call we’re returning all terms from a particular record. One of those terms is a record for the chemical taxol which is used as a chemotherapeutic agent. The compound metadata includes information from the CHEBI chemical database.