There is a belief among certain peoples in the world, that knowing the true name of a thing gives one some degree of power over it.
One of the challenges for anyone writing software in a scientific setting, is figuring out the true (and sometimes unintentionally hidden) names of things.
Imagine for a moment that you’re rewriting an old piece of software. One of the first tasks to do is to familiarize yourself with the code. You come across a string called id in a class called Protein. Is this the UniProt accession for the protein, or the GenPept ID, or perhaps some internally generated ID? If the original developer were around, you might be able to ask them what they meant, or you might have to look at the data to determine what it is, or trace through the code to see how a particular attribute is used. If you’re really lucky, the previous developer provided good documentation for the code. However, this is not always the case.
It gets a little more complex when you’re dealing with numeric values and you find yourself wondering, “is that ‘percent inhibition’ field, the value for a single experiment, or the average of a set of runs”.
It can be even more confusing if a commonly used word has some institutionally-specific, or user-specific meaning. The word proteome for example is typically taken to mean the set of all proteins from an organism of a particular species. But in an institutional setting it can sometimes mean a particular subset of proteins being studied. Or when biologists talk about a gene name, they usually mean something like “KRAS” — something that a bioinformaticist would call a “gene symbol”. The gene name being “Kirsten rat sarcoma viral oncogene homolog”.
So, how do you insure that the name of a thing is universally understood? What if there was a way to annotate an application to make the metadata (specifically the names of things) more apparent?
Getting The Names Of Things
To start with, we need a universally accessible and recognizable place that contains the names of things — also known as an ontology. Recently, Peter Rice’s Lab at EBI created the EMBRACE Data and Methods (EDAM) ontology for bioinformatics. The goal is to create an ontology for commonly used terms in bioinformatics. You can read more about the effort here.
Let’s take a look at a sample entry from the ontology and see what we can do with it:
What does this tell us?
Firstly, it tells us that there’s an entity called a UniProt accession. At has a URI that uniquely identifies it within the ontology. It has a plain-English definition for that entity is. Gives us an example of what it looks like, and provides us with a regular expression that we can use to validate an accession.
How can we use it?
Since most applications are written with a minimum of 3 tiers: a database tier, a middle-ware tier, and a user interface tier — an ontology like EDAM can be used to annotate all three to insure some level of consistency in your application. So if someone needs to reuse your web services, they can be reasonably assured of what they’re getting when making a RESTful call. Moreover, regardless of your coding standards, and the brevity of your variable names, you can use the ontology to provide users with consistent names for things.
In the posts that follow, I’ll take a look at the challenges presented by each of the tiers and show you some examples of how to easily integrate ontology references into each tier of your application.
Special thanks to Jon Ison and Matúš Kalaš, creators of the EDAM ontology for patiently answering all my questions.