One of the key challenges for bioinformaticians in this age is extracting meaningful findings from abstracts and full text articles. Text mining solutions can only take it so far, and previous comparative studies have shown that Amazon Mechanical Turk solutions and automated text mining solutions have similar levels of success when it comes to identifying terms in text and attaching some kind of semantic meaning them.
This led me to wonder if it might be possible for the submitters of papers to use simple HTML5 markup to highlight key findings and terms in an abstract in a semantically meaningful way. We’d also want the terms to be linkable to external databases, so that people unfamiliar with a particular gene or pathway could easily view more information. Let’s take a look at a snippet of text from an abstract and see what we can do with it. You can see the original abstract here.
In the first sentence we see a reference to “Pancreatic ductal adenocarcinoma” and we want to turn that into a reference to the OMIM (Online Mendelian Inheritance In Man) database.
Here we’ve simply added a disease tag and given it the OMIM ID for pancreatic cancer. Ideally though, it might be nice to have a contextual menu popup whenever you click on it, and have it search OMIM, Wikipedia, and other sites for the term surrounded by the disease tags. (I leave that as an exercise for the student).
Typically semantic web applications make use of RDF triples, where an assertion is expressed in terms of a subject, object, and predicate (the three bits of the triple). However, it’s often difficult to parse a sentence and turn it into something so simple. Let’s take a look at an example.
Yap functioned as a critical transcriptional switch downstream of the oncogenic KRAS-mitogen-activated protein kinase (MAPK) pathway, promoting the expression of genes encoding secretory factors that cumulatively sustained neoplastic proliferation, a tumorigenic stromal response in the tumor microenvironment, and PDAC progression in Kras and Kras:Trp53 mutant pancreas tissue.
We can tell that Yap is the subject, but what’s the object? It’s a bit harder to tease out and you don’t want to cut off a critical adjective. This sentence though, actually has multiple assertions in it. And this is part of what makes automated text mining difficult when dealing with scientific text.
Styling the text
Now that we’ve tagged most of the important bits in the abstract, we want to make it easy for the reader to be able to readily understand the semantics when reading the abstract. A little CSS does the trick. Here’s an example of how to style the gene tag:
We want the user to be able to click on any of the marked up areas of the abstract, and navigate to an appropriate external resource. For a gene, we want to go to the Entrez Gene record; for a pathway, we want to go to KEGG, etc. Here’s an example for adding interactivity to a gene tag.
The screenshot below gives you an idea of what the results look like. As you can see it might be good to have a feature that allows you to toggle the rendering of the markup off and on.
One of the challenges with a project like this is that sometimes you encounter markup that won’t be linkable to a database entry. In the example shown above, the KRAS(G12D) mutation is pretty common and is associated with pancreatic cancer. However, SwissVar does not have the association with pancreatic cancer. Similarly P53 mutation, doesn’t exist at all in SwissVar.