Google IO, Schemas, and the Network of BioThings

Last week’s Google IO conference came with a slew of announcements, and videos, but perhaps the most interesting idea, wasn’t accompanied by a flashy announcement.  It did have some intriguing video’s though.  The first of which is shown below.

What I find particularly intriguing about these videos is that Google is using JSON-LD (the linked data form of JSON) to make semantically rich information more easily searchable and visualizable.  Google’s not the only one doing this though — Microsoft Bing, and Yahoo are also using it.

Pharmaceutical companies aren’t traditionally known for sharing information. However, one area where they do share information is publishing drug pipeline data.  Why?  There are a number of reasons:

  • Investors need guidance about a company’s performance, and the pipeline is one of the ways that analysts do that.  They look at the “mix” of drugs in the pipeline, what stage of development they’re in, the indications they’re designed to treat, etc. Those factors all effect the risk exposure of a company, and managing risk is what it’s all about for investors.
  • Molecular diagnostics (MDx) companies are playing an increasingly important role in identifying patients for clinical trials; and after approval, for identifying the patients most likely to benefit from a treatment.  They too, publish information about their products. About the mutations and expression fold-changes they can detect, etc.  If you’re a pharma, you want a way to easily connect the dots between the programs you’re working on, the diagnostics that MDx companies are working on.
  • The pharmaceutical industry is becoming increasingly fragmented, and it’s important for pharmaceutical companies, biotechs, and virtual pharmas to keep tabs on one another in order to identify collaboration, and in-licensing opportunities. If you take a look at the pipeline pages for the major pharmaceutical companies, you’ll see that by and large, the information they present is the same (“we’re working on these targets and indications, we have these programs in Phase III, we’ve attritted these programs, etc”).  The way the information is presented varies widely.

And that’s the crux of the problem.  Pipeline data in too many formats, visualized in a variety of ways, and can’t be queried.  You can’t simply Google “who’s working on KRAS” and get a cogent answer. But suppose it wasn’t that way?

A common data format
Suppose you could publish the information in a standard data format that search engines like Google, Yahoo and Bing could readily consume.  And by the virtue of being in a consistent format, you could easily query it across all sites.  Regardless of how Pfizer or Merck, GSK, J&J or Lilly had formatted their websites, the pipeline information would be easily searchable.  And not just Big Pharma, but biotechs and virtual pharmas would also be able to publish their information in the same format. Thus making it easier to find the collaborators they need to keep their pipelines full.

Visualizing the pipeline
The other technology that DID get a lot of airplay at Google IO, was Polymer. It’s an open source project to create styleable, bindable JavaScript-, and HTML5-based components that can be easily added to a web page.  The example below shows a demo of a visual editor that allows you to build an app and bind it to web services.

This provides the second half of the solution.  Once you have a common data format, you want a means of rendering the information in a manner that is useful to the people consuming the data.  If you’re a web developer for one of the Big Pharmas, you want to render that content, while at the same time preserving the corporate standards.  This is where Polymer comes in.  Polymer lets you style components to match your standards using CSS.

One really powerful idea, buried deep in one of the Polymer videos, was the prospect of serving polymer components from multiple servers.  So that if you wanted to show people the conformation of your target protein, you could use a Polymer-based PDB widget (served up from the PDB servers) to render it. You could display the genomic sequence of your target along with common mutations using a component found in the NCBI’s ClinVar server.  You might highlight your targets in the context of a pathway diagram component served up from Wikipathways.

This isn’t a new idea.  It’s been floating around the web and semantic web communities for a while in various guises; sometimes referred to as lenses, widgets, or shards.  Regardless of what you call it though, the idea is that you should be able to compose, style and interact with these components in a way that suits your corporate standards, and the needs of your users.

What’s missing?
So what’s needed to make this happen? This is where the Network of BioThings comes into the picture. Organizations responsible for serving up the bits and bobs (both the content in json-ld format, and the visual components to render the content) need to step up.  If you have a vested interest (as the pharmaceutical industry does), and an organization dedicated to creating tools to serve the pre-competitive needs of the industry (like the Pistoia Alliance), then this kind of effort should be easy enough to agree upon.

From the academic perspective, what’s needed is a similar effort, to provide data and components for genes, proteins, pathways, etc.  At first glance, this might seem like a daunting effort at herding cats (both industrial and academic cats).  But this type of data integration effort is not without precedent.  In fact, BioGPS has a lot of the information needed to make this happen already, we just need to be able to request data in json-ld format.

Currently, the way you refer to an object using json-ld must be defined in Schema.org in order for Google and the other search engines to understand it. For example, if you want to refer to a protein, the definition of the protein’s metadata (it’s UniProt ID, it’s domains, etc) must be defined in Schema.org.  A quick search of Schema.org for a “protein” term will net you “Prion”, “Mass”, “NutritionInformation”.  In other words, nothing useful.

That’s a real problem.  Because, this information is already defined in NCBO’s BioOntology Portal.  Duplicating and respecifying it elsewhere seems like a complete waste of time.  Moreover, duplicating data from a “system of record” into another system, increases the chances that the secondary system will grow out of date.  You have only to look at the gene and protein information available in FreeBase to see what I mean.  Hopefully at some point there will be a way to point the Schema.org database to another ontology server, but for the time being that doesn’t seem to be the case.

 

 

 

 

 

Advertisements

About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Informatics, Semantic Web and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s