Reproducibility and Provenance

A Taverna Workflow

One of the key problems with reproducibility in scientific research is understanding and preserving the provenance of the data, and any digital methods used to turn raw data into actionable information.

The following paper recently appeared in PLoS Computational Biology [Ten Simple Rules for the Care and Feeding of Scientific Data] along with some new tools for helping preserve that provenance. And this paper describes how provenance was used to assess the tuberculosis drugome.


To realistically preserve the provenance of data and workflows though you need tools capable of dealing with provenance.  You need some means of identifying what some piece of data is (relative to some well-known and accepted model of reality aka an ontology), and where that data came from.

You also need some means of making apparent how that data was refined prior to publication. For a piece of novel code, that means that you need to use a source-code repository, and reference the class or script where all the magic happens, so that others can review the work, and verify that it does what you say it does.  You should also unit test that code so that if someone tries to run your code with a different data set, and ends up with some whacky results, they (or you) be able to quickly determine why that’s happening.  The unit test also lets you easily retest your code with a variety of datasets to determine how well your code works.

But writing code isn’t necessarily an integral part of every researcher’s toolbox, nor should it be.  Over the past 10 years, open source workflow tools like Taverna, Galaxy, and KNIME emerged to help researchers analyze data and share both the analytical methods and the data with their colleagues.

Of these tools though, only Taverna provides support for semantic provenance. (Full disclosure, I worked on Taverna back in the day). What does “support for provenance” entail for the user?  Here’s a brief summary from the Taverna website. [Taverna workflow suite…]

In Taverna, each workflow input, output, and service (either a web service or a local service) has editable provenance information associated with it.  You can specify that a particular String input is a SwissProt accession, you can specify the version of the dataset that was used, and each time you run your workflow, provenance information is generated and saved.

In addition, Taverna users can publish their workflows to myExperiment — think of it as git for workflows.  MyExperiment lets others run your workflows, and helps you connect with the researchers who created the workflow.  You can also add workflows from myExperiment as subworkflows, thus making the most of reusability.

Taverna also makes use of BioCatalogue’s growing semantic catalogue of web services.

Open Data Access with Google Docs
In the past, I’ve seen repeated recommendations to use Git and figshare to share data and code.  But one of the simplest approaches to sharing data involves using Google Docs. The collaborators on a paper can share a common folder on Google Drive.  They can edit the text of the paper, and the spreadsheets containing the data, and the changes are tracked so that it becomes obvious where contributions to the effort came from.  The data can the be made available in a read-only manner so that a research attempting to reproduce the results will be able to do so.


About aspenbio

I write software for scientists. I'm interested in Java/Groovy/Grails, the Semantic Web and Cancer Biology.
This entry was posted in Bioinformatics, Informatics, Science Blogging and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s