Network of BioThings: Making PDFs Useful

Recently, Andrew Su of Scripps posted a series of blog entries on the Network of Biothings (NoB) efforts, especially their efforts to uncover use cases that could aid medical research. In a nutshell,”The Network of BioThings aims to structure the biological knowledge found in biomedical research articles by comprehensively annotating BioThings (genes, proteins, mutations, diseases, drugs, etc.) in a timely fashion (one week after publication). This would allow for focused browsing as well as large-scale data mining.”[1]

Perhaps the most interesting use case involves using these annotations to aid in literature-based research.  If you’ve ever found yourself researching a drug target, a disease or a pathway using current tools, the process can be ungainly and time consuming.  Let’s take a look:


Start Here
Let’s say that you’re researching a disease.  You want to gain a better understanding of the disease, the processes/pathways driving the progression of the disease, and the targets that play a role in those pathways.

To familiarize yourself with the clinical presentation and treatment of the disease, a tool like the NCI’s PDQ can give you a cognitive framework for your research.

To get a research perspective of the disease, you could start with OMIM, find a list of genes/proteins involved in a disease, get the GeneRIFs (References Into Function), and UniProt references and use these papers as a starting point for your research.

To get a perspective on the latest findings, you can search for the review articles on the subject in PubMed.

Whether or your starting point is OMIM or PubMed, the result is the same…  a pile of papers and that’s where the fun begins.  Perhaps you’ve bookmarked the papers in Pocket, Mendeley, or added them to Paperpile/Google Drive so that you could read them later on your tablet.  Your first task is to sift through the papers and find the most meaningful results.

Crowdsourcing annotation
As you’re reading you find an interesting assertion in a paper.  If you had printed the paper, at this point you’d be reaching for a highlighter.  But suppose you could highlight that assertion in the PDF, and the software would identify similar assertions made throughout the paper.  This could let you quickly navigate to the experimental results that backup that assertion.  The software could parse the highlighted text and identify specific words as genes, proteins, pathways, and other concepts of interest.

Suppose that not only your highlight, but the publicly accessible highlights and notes made by previous researchers could be seen?  This might direct your attention to a particular part of the paper, or perhaps a flaw in the paper’s methodology that you hadn’t seen before.

You come to an interesting place in the paper, and there’s a citation supporting their claim.  If you’re reading a PDF, you have to scroll to the bottom of the paper, find the citation, perhaps look up the paper in PubMed, read the abstract, download the paper… OK where was I?  What if simply clicking on the citation, caused the reference to appear in a sidebar?  You could bookmark that citation to follow up later, and carry on with your reading.

As you’re reading through the paper you see genes, pathways, drugs, and medical terms that you’re unfamiliar with.  Wouldn’t it be nice if you could simply click on a gene and get immediate answers to questions like:

  • What does this gene/protein do? (Entrez Gene, UniProt, GO)
  • What does the protein look like? (PDB)
  • Are there any compounds/drugs that already target this protein or pathway? (DrugBank, Drug/Gene Interaction database)
  • What clinical trials are currently underway that target these genes, or use the drugs found in the previous step? (
  • Is this gene expressed in other tissues, that might result in off-target effects? (GEO/ArrayExpress)
  • What pathways does this gene/protein participate in? (Wikipathways, KEGG)
  • What are the typical mutations for this gene and this disease (ClinVar/SwissVar)
  • What biomarkers are currently available?
  • Are there any kits for this gene?

What if you could automatically highlight the genes found in the paper, in the pathway diagram, so you could see the genes in a biological context?

What if you could cluster the genes by GO term and protein family?  What if you could see these results across all of the papers that you’re currently reading?

What if you could easily find the papers/posters from the latest ASCO/AACR meeting where these targets, or this disease was recently discussed?

What if you could cluster the papers by author to find the expert in a particular area?

What if you could look at the Methods section of the paper and see links to the standard protocol (or perhaps the variation of that protocol used by the author)? Nothing more frustrating than wasting your time trying to recapitulate results derived from an incomplete or invalid protocol.

Summarizing Results
OK.  So you’ve done an exhaustive study literature review.  You’ve gone through the references that they cite.  Your desk is littered with highlighted papers.  Where’s that paper that made that interesting claim?  What if the PubMed metadata (authors, keywords, abstract, etc), not to mention the  gene/protein metadata, were embedded in the PDF? You could find that paper with a simple desktop search using Spotlight (in MacOS X) or the File Explorer (in Windows).

What if all of your research were available in a mind map or network diagram, so that you could easily navigate through the pile of papers using various clusters like keywords, MeSH terms, GO terms, authors, etc.

It’s time to summarize the research in a short presentation for your boss.  Wouldn’t it be nice if you could copy and paste specific images and data into your presentation and the citation would automatically be inserted?  And what if the image itself had PubMed metadata in it, so that if for some reason the image gets separated from the presentation and re-used elsewhere you can still find the paper where the image originated.

The Search Continues…
You’ve summarized the results, delivered the presentation, gotten approval to proceed with the project. Three months into the project, a colleague hands you a paper. You’ve just gotten blindsided by a recent development!

Suppose that you could use the metadata that you embedded in the PDFs to create a literature surveillance bot?  You could use the keywords and MeSH terms to find similar papers.  You could monitor the RSS feeds for the journals where the papers appeared.  If anything came up, you’d be automatically notified.

The PDF IS the Network of BioThings
The underlying problem though, is that the PDF file format is ill-suited for scientific research.  While it can keep all of the digital assets (the images, diagrams, and tables) together in a single downloadable form; as we’ve seen in this article, it lacks the semantic richness that drives scientific innovation, making literature-based research more time consuming than it should be.  Having a means to semantically annotate papers, would make literature-based research infinitely easier to perform, and would turn the PDF into nexus of the Network of BioThings.



About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Bioinformatics, Cancer Research, Informatics, Science Blogging, Semantic Web and tagged . Bookmark the permalink.

3 Responses to Network of BioThings: Making PDFs Useful

  1. Pingback: Winner of the BioThings DBP contest! | The Su Lab

  2. Pingback: Finding buried treasure in shifting sand | Gee-aI-eN-Gee

  3. Pingback: Finding buried treasure in shifting sand | The Su Lab

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s