PubMed vs EuropePMC: Let’s Get Ready To Rumble


For most researchers, PubMed is the go-to resource for all biomedical literature. But from a programmatic standpoint it has some real challenges that make it difficult to integrate into many informatics applications.  Let’s take a look at a typical application.

Suppose we have an internal application used for target identification and tracking, and we want to add the ability to perform literature searches, and add selected hits to a specific target.

To do this using PubMed’s eUtil’s API requires two calls which have overly verbose results that must be parsed.  (Click on the Sample Call links below to see an example of what the call looks like, as well as the server response).

  1. Perform a search, and get a list of PubMed IDs. [Sample Call]
  2. Fetch the PubMed records, allow the user to review them, and then save the selected records. [Sample Call]

The first problem is that the search only returns IDs and search metadata. It doesn’t return titles, or abstracts, or anything else that a user might find useful in making a decision about which article to download or view.

The second problem is that the results when fetching PubMed articles are too verbose. The response is only available in XML, not JSON, and this has a performance impact. For example, all of the dates found in the record appear as separate tags.

Rather than:
<date-created date="2017-01-27"/>
or even
<date-created year="2017" month="01" day="27"/>

That’s 79 characters vs 33 or 47 (depending on which format you prefer).

A simple author name appears like this is:

<Author ValidYN="Y">
where this would do
<author lastname="Tao" firstname="Huimin" initials="H"/>
That’s 107 characters vs 56.
On the surface these seem like niggling complaints, but when you take into account the fact that the record size negatively impacts the speed and responsiveness of your application, and the amount of memory and processing power required to parse the data, then it has some serious implications for your application. For each author or date you could reduce the number of characters by half.
Aside from the verbosity of the results though, PubMed does not attempt to text mine abstract data. The record does not contain gene, protein, pathway or compound information which would make it truly useful in a drug discovery or literature mining application. The closest we come to getting article metadata are the MeSH (Medical Search Heading) terms.
Although BioGroovy makes it easy to search, download and parse PubMed records; it (like other libraries and applications) is not immune to the limitations of the eUtils API.


Perhaps the best alternative to PubMed is EuropePMC.  The database includes both PubMed abstracts, and PubMed Central full text articles.  The EuropePMC API provides you with both XML and JSON response formats. Let’s take a look at our previous algorithm, and how EuropePMC’s API differs from PubMed’s.

  1. Perform a search. [Sample Call]
  2. Fetch the selected records [Sample Call]

One of the first things you’ll notice is that the search results actually contain useful information.  In the sample below, we can see a title, the DOI, a well-formatted author, the journal. We can even see if the article has text-mined terms associated with it.

id: "28094263",
source: "MED",
pmid: "28094263",
doi: "10.1038/nrclinonc.2017.3",
title: "Pancreatic cancer: Pancreatic cancer cells digest extracellular protein.",
authorString: "Sidaway P.",
journalTitle: "Nat Rev Clin Oncol",
pubYear: "2017",
journalIssn: "1759-4774; 1759-4782; ",
pubType: "journal article",
isOpenAccess: "N",
inEPMC: "N",
inPMC: "N",
hasPDF: "N",
hasBook: "N",
citedByCount: 0,
hasReferences: "N",
hasTextMinedTerms: "N",
hasDbCrossReferences: "N",
hasLabsLinks: "Y",
epmcAuthMan: "N",
hasTMAccessionNumbers: "N"


What makes this especially useful is the results can easily be used in a user interface, and contain enough information to allow a user to determine whether or not the article is potentially useful.

You can also fetch text mined terms, such as genes, diseases, and chemicals from EuropePMC records as well. [Sample Call]  For example, in the previous call we’re returning all terms from a particular record. One of those terms is a record for the chemical taxol which is used as a chemotherapeutic agent. The compound metadata includes information from the CHEBI chemical database.


Posted in Bioinformatics, Informatics | Tagged , , | Leave a comment

A New Year, A New Site, A New Service

It’s the start of a new year, and nothing says New Year like a website refresh. With the rise in the number of visitors to the site using mobile browsers, we’ve updated the site to make it more mobile friendly. Not only is it more easily viewable on smartphones and tablets, but you can add it to your home screen just like any other app.


That last bit is important because we’re also announcing the launch of a new service simply called Aspen Gene. With it you can look up information on any gene. The service is powered by the web service developed by Chunlei Wu at The Scripps Research Institute’s Su Lab.  To visualize the data, we’ve developed a series of web components and are making them available through the new open source BioPolymer project.


The Aspen Gene Search interface

Let’s take a look at the service. You start by entering in the symbol for a gene of interest, in this KRAS. Then tap the “Search” button to start the search.  The search results will then appear as a series of cards at the bottom of the screen. Tap on the arrow icon on the result card, and the gene summary will appear as shown below.

Gene Summary

The Summary tab provides an overview of the gene, including its symbol, synonyms, and IDs in related databases.  We’re currently linking to NCBI’s EntrezGene database, the Online Mendelian Inheritance In Man (OMIM), the Human Genome Nomenclature Committee, UniGene, and PharmGKB.  You can tap on the icon to the right of the field to open the record in a new window.


Gene Summary Tab

Protein Information

The Protein tab shows the UniProt ID, along with a list of InterPro domains found in the protein.  You can tap on any of the domains to see more information.  The Protein Database section shows a list of PDB IDs. You can tap on any ID to display the associated protein structure.


Protein Information Tab

Pathway Information

The Pathways tab shows a list of all of the pathways that the gene participates in. This includes entries from KEGG, Reactome, PharmGKB, Wikipathways and more. You can tap on any pathway name to see a diagram of the pathway.


Pathway Information Tab



The Publications tab shows a list of GeneRIF publications. These are References Into Function, or papers that indicate the function of a gene, and are found in the NCBI EntrezGene database. Tap on the card to display the PubMed record for the article.


Publication Information Tab

So, come visit us, and give our new site (and our new service a try)!

Posted in Bioinformatics, Informatics, Science Blogging | Leave a comment

Drug Target Identification: And then a miracle occurs


At the front end of most drug discovery programs lies a step called Target Identification, and a few months ago I sat down with a colleague to discuss their approach to target identification.  In particular, “how do you characterize a target”? I was surprised at how much that process can vary from company to company.

As I set out to describe my workflow for this blog post, I was reminded of this cartoon, and how much work goes on between the starting point and the end point when researching the function of genes.

I should preface what I’m about to say, with the words “this is the way I work” your goals and tools might be different, and I’m always curious about the way people work.  So please feel free to comment.

At a macroscopic level (regardless of your ultimate research goals) there are three levels of research:

  1. Foundational Research: where you familiarize yourself with the general “landscape” of a particular research topic.
  2. Deep Dive Research: where you examine certain concepts exposed in step 1 in-depth.
  3. Current Research: where you create a “surveillance” program to keep yourself up-to-date with the latest developments in a particular area of research.

In the examples which follow, I’ll be showing you the steps that I take and the tools that I use to learn more about the target space for pancreatic cancer.

Foundational Research
My goal at this stage in the game is to answer the following questions:

  • What is the etiology of the disease? (What syndromes predispose people to the disease and what percentage of the patient population do they account for?)
  • How does the disease progress? (What are the clinical stages?)
  • What genes & proteins are involved in the progression of the disease?
  • What pathways & disease processes do they participate in?
  • What is the current standard of care, and what genes are targeted by that standard of care?
  • Who are the thought leaders in this area?
  • Is research in this area heating up?

Since my workflow is very disease-centric, I usually start by searching the OMIM database.  OMIM provides a good overview of the disease, with information on the genes involved, and relevant literature.  Recently, I’ve also added Wikipedia to the list.  I’ve been pleasantly surprised with the depth of information available on Wikipedia, both for diseases and for genes.  In addition, to these more general sources, the National Cancer Institute’s PDQ site provides a good overview of the clinical stages of the disease, and the standards of care applied at each stage.  This information is critical for two reasons.  It gives discovery scientists insight into the clinical presentation of the disease, and makes it possible to design a drug or cocktail that targets a particular patient population.

My usual starting point for most research projects is PubMed.  And I start by looking for review papers on a topic of interest.  In this case my query looks like this:
(pancreatic cancer) AND “review”[Publication Type] 

You can further restrict the results by limiting hits to the last few years. Sorting by publication date also helps focus your attention on the latest developments.  You’ll find more tips and tricks for using PubMed here.

As I read through the review papers, I compile a list of genes which I keep in two “piles” — targets and biomarkers.  I also compile a list of pathways, and attempt to connect those to specific biological processes involved in the disease.

Gene-centric vs Pathway Centric vs Disease Centric Workflows
When I first started out in this industry, I thought, perhaps rather naively, that drug discovery research always followed the same path, and consequently that every company used the same approach to identify new drug candidates.  However, I quickly learned that this wasn’t the case.

Some companies used a traditional compound-centric approach to drug discovery.  They would screen a compound through a particular target panel, find some interesting binding characteristics for a target, and then back-track to an indication or set of indications.

In a gene-centric approach, the process starts with a gene.  The function of the gene is determined (at least initially) by the Gene Ontology terms, by literature, by sequence homology, by protein domain, etc. Depending on the drug class (small molecule vs peptide or antibody, siRNA, etc) certain types of genes/transcripts/proteins may be more or less amenable to being addressed. For example, antibodies may be more appropriate for targets that have extracellular domains to which the antibody can attach.

A few years ago, Novartis espoused a more pathway-centric approach to drug discovery. The aim of which was to use the signaling pathways to help identify new targets, either for monotherapies, or collections of targets for drug cocktails, or for repurposing existing drugs.

In a disease-centric approach, the disease biology, and the genes that drive that biology are used to drive the strategy for therapeutic development.  This approach, originally pioneered by organizations with a vested interest in research in particular disease areas, appears to be the most promising.  These organizations, that I loosely classify as “Translational Medicine Companies”, have a great deal of knowledge and experience in a particular indication, and thus tend to take a systems biology approach to identifying potential targets and drug candidates.  Organizations like the Michael J. Fox Foundation, and globalCure (an initiative of Translational Genomics Institute to find new treatments for Pancreatic Cancer) spring to mind.


Posted in Bioinformatics, Cancer Research, Drug Development, Informatics, pancreatic cancer, Science Blogging | Leave a comment

Exchanging Drug Pipeline Data

The pharmaceutical R&D environment has always been collaborative in nature — never moreso today. Key to the success of those collaborations is the ability to share information about drug R&D programs with a wide variety of potential partners and investors. Traditionally pharmaceutical companies rely on expensive databases to identify potential partners. These databases usually do a good job of identifying programs inpipeline032415 other pharmaceutical companies, but their results vary widely when it comes to identifying programs in academic labs and small biotech companies. And it’s increasingly these types of organizations that pharmaceutical companies are turning to in an effort to reduce R&D costs, and gain specialist expertise in certain indications.

In the past, small biotech companies have relied upon events like BIO and EBD (and the previously mentioned commercial databases) to get on the radar of pharmaceutical companies. However, these events occur only once a year and a year can be a long wait for a startup company.  In addition, any discrepancies in the project information in a commercial database can take months to resolve, which can lead to more lost opportunities.

A cursory survey of pharmaceutical company web sites reveals that despite the dazzling variety of ways that pharmaceutical pipelines are represented, the data is by-and-large the same across all of the sites — Target, therapeutic area and class, indication, and project status are a part of every pipeline page. However, because the webpage and the data are tightly bound together, it’s impossible to scrape the data programmatically, and search across all of the organizations.

But suppose for a moment, that every drug pipeline, at every company involved in pharmaceutical discovery was just a Google search away.  Suppose, that regardless of the size of your company, the work that you’re doing was instantly discoverable by potential partners and investors.

The first step in such an effort would be to make drug project information accessible by search engines. Two years ago, Google (along with Yahoo and Bing) announced support for a new site metadata standard using JSON-LD at their Google IO developers conference. This new data format makes it possible for companies to describe themselves and their products (albeit in very generic terms). Google, Yahoo and Bing display this information in a summary to the right of your search results.

Recently, we proposed a pre-competitive collaborative project with the Pistoia Alliance (an industry-wide organization with representatives from numerous pharmaceutical companies) to define a new standard for representing pharmaceutical project information.

Our goal is to create a level playing field that ultimately helps the members of the pharmaceutical R&D ecosystem (academic labs, biotech companies, research foundations, and pharmaceutical companies) identify new collaborative opportunities and answer the following types of questions:

  • Which organizations currently have drug programs for indication X?
  • Which organizations are currently working on complementary drug programs in pathway Y?
  • Which organizations have a drug program that targets gene Z?
  • I have a drug program for indication X, the target also plays a role in indication Y.  Who has expertise in that area that I can leverage?
  • Which potential partners are best suited for my drug program?
  • Who do I contact at company X about my cancer drug program?
  • Who is currently conducting clinical trials for indication X?

To learn more about this project and how you can help, please join the conversation at the Pistoia Alliance.

Posted in Uncategorized | Leave a comment

Parsing Data Files With BioGroovy

In a previous post we showed how you can use BioGroovy to download a data file from a webservice.  In this post, we’ll show you how to parse the file that you’ve downloaded.

After handling all of the imports (shown in the first block of code), the real work starts with the EntrezGeneSlurper.  We create a file reference to the datafile we downloaded earlier, and then call the method to read the datafile and print the results out to the console.

Grab (group='org.biogroovy', module='biogroovy', version='1.1')
import org.biogroovy.models.Gene

EntrezGeneSlurper fetcher = new EntrezGeneSlurper();
File file = new File("/path/to/file.gene.xml");
Gene gene = FileInputStream(file));
println gene;

Posted in Uncategorized | Leave a comment

Fetching Data Files With BioGroovy

It’s often useful to fetch data from a web service and cache it in your local directory. This cuts down on the amount of traffic between your machine and the web service you’re invoking, and it makes unit testing your code easier. In this example, we’ll fetch data for 3 genes and write it out as XML files.

 import org.biogroovy.models.Gene;
 @Grab(group='org.biogroovy', module='biogroovy', version='1.1')

 EntrezGeneSlurper slurper = new EntrezGeneSlurper();
 File dir = new File(System.getProperty('user.home'), 'biogroovy_test_files');
 slurper.fetchAsFile(dir, "675,1034,133")

 // double check that the files were written out
 dir.list().each{println it}
Posted in Bioinformatics, Informatics | Tagged | 2 Comments

Groovy vs Python

At a recent Informatics Lunch meeting, surrounded by a host of dedicated Python users I confessed to being somewhat ignorant of the business drivers behind using Python in a bioinformatics setting. Although, I have used Python on occasion, it was usually to solve a customer’s problem, and not my first choice when reaching into the toolbox.  My feeling is that Python has become the de facto standard for bioinformatics scripting more as an historical fluke rather than as a result of duking it out for top-of-the-technology pile against other languages.

Oddly enough, during that conversation, a lot of the conversation floated around Python’s shortcomings, rather than its strengths.  Some of these shortcomings included the Python 2.0 vs 3.0 backwards incompatibility problem, and the fact that compiling Python into native code is tricky and requires you to keep in mind platform-specific differences. Luckily, neither of these problems have been inflicted on Java or Groovy users.

Some of the typical use-cases cited for using Python, include:

It’s scriptable

Groovy has always been scriptable. The Groovy console lets you compose and execute scripts.  It can also be compiled to Java byte-code and thus run on any JVM. With the release of Java 9, Java itself is also scriptable using the JShell scripting console.

I can read data files with it

The Apache POI library provides support for reading and writing Microsoft Office files like PowerPoint, Word and Excel. Here’s a simple example that shows you how to iterate through the cells in an Excel spreadsheet.

// read the Excel file
Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));

// get the first sheet
Sheet sheet = wb.getSheetAt(0);

// Decide which rows to process
int rowStart = Math.min(15, sheet.getFirstRowNum());
int rowEnd = Math.max(1400, sheet.getLastRowNum());

for (int rowNum = rowStart; rowNum < rowEnd; rowNum++) {
 Row r = sheet.getRow(rowNum);
 if (r == null) {
 // This whole row is empty
 // Handle it as needed

 int lastColumn = Math.max(r.getLastCellNum(), MY_MINIMUM_COLUMN_COUNT);

 for (int cn = 0; cn < lastColumn; cn++) {
 Cell c = r.getCell(cn, Row.RETURN_BLANK_AS_NULL);
 if (c == null) {
 // The spreadsheet is empty in this cell
 } else {
 // Do something useful with the cell's contents

I can query databases with it

Groovy can make use of any JDBC data source, NoSQL database, graph database, or cloud database.  Here’s a simple example that shows you how to execute a JDBC query and iterate through the results.

// create a database connection to an in-memory hsql database
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbc.JDBCDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
//query data table called 'project'
sql.eachRow('select * from PROJECT', 2, 2) { row ->
     println "${} ($row.url)"

I can do text mining with it

The WEKA library is designed with text mining in mind.  Since the library is Java-based, we can easily add it to a project, and execute it as a script.  There are plenty of examples that show how to use WEKA here.

I can graph data with it

The Java UI library JavaFX includes a variety of graphing components including pie charts, line charts, area charts, etc. The subject is a little beyond the scope of this post, but Oracle provides some great tutorials on the subject here.  And here’s an example from StackOverflow that shows how to output the chart as a PNG image without displaying it.


These are just some of the use-cases that were cited during the meeting.  So, I’d like to throw it out there to the audience — if you have some additional use cases for Python, I’d love to hear more about them in the comments section.



Posted in Uncategorized | Leave a comment