There’s been a lot of traffic over the past couple of weeks with respect to Open Access Science. It started with Phil Bourne’s article on what the future of open access science might look like. Dr. Bourne is the Editor of PLoS Computational Biology and a long-time advocate of Open Access journals. His article highlights some of the tasks that one could reasonably expect from a future publisher, such as the ability to review the data behind an article, and the ability to execute any workflows used to process that data. These activities are key aspects of the scientific process and underscore the lack of support for provenance in the current publishing model. Before you can “Stand on the Shoulders of Giants” you have to know they are indeed giants and that the body of work upon which your research is based, is solid and well-understood.
It is this lack of provenance that has motivated proponents of Semantic Web technologies to create applications like Taverna, and myExperiment. Taverna provides scientists with a way of determining the provenance of data, and executing workflows on that data. myExperiment provides scientists with an open repository for those workflows which can be reviewed, rated and reused by other scientists.
Ultimately though, a paper is not merely a collection of data sets but a summation of the learnings derived from those data sets and a description of the methods used to arrive at a given set of conclusions. And it’s the publication piece of the puzzle that has yet to be addressed by the scientific community. The twitterverse, and blogosphere were recently abuzz with the news that the University of California library system had rejected the 400% increase in the price of access to the Nature journals. Growing dissatisfaction with the traditional model in scientific circles, led to the formation of the Public Library of Science journals and to the Open Access model in general.
The current publishing models that I’ve seen are:
- Traditional model – Papers are submitted to publishers, reviewed by peers, and published or rejected. The costs of this model are borne by the subscribers through a combination of individual subscriptions (often costing hundreds of dollars), institutional site access licensing agreements (where university library systems and companies pay for online access to the journals), and pay-per-view (where individual papers can be purchased).
- Open access – The same process is used as before; however, the submission and peer-review process is underwritten by the submitter.
- ArXiv.org – No peer-review process is performed, but the impact of the article is based on the number of accesses to the paper.
As a software engineer who works on open source scientific applications and frameworks, when I look at this, I scratch my head and wonder “why don’t they just do the equivalent of a code review”? And that’s really, where the germ of the idea behind this blog posting started. What if the scientific publishing process were more like an open source project? How would the need for peer-review be balanced with the need to publish? Who should bear the costs? Can a publishing model be created that minimizes bias and allows good ideas to emerge in the face of scientific groupthink?
Would the following approach work?
- Scientists submit a “paper” to the “publisher”. The contents of the paper contain the same basic sections that you currently see, with the following differences:
- A “data” section would contain the raw data collected for the experiment.
- A “workflow” section would contain a workflow (or link to a myExperiment workflow) which the user would be able to review and execute
- The “references” section could be viewed as a graph that the user could browse. This would let you see a graphical representation of the references, which would show impact factors (and digg scores) for each reference. This gives the reader some idea of how solid the footing for the research is.
- An “authors” section would allow the reader to see previous publication history, and perhaps link to the equivalent of LinkedIn for Scientists. This would let the reader get a sense of the authors standing in the community with respect to a particular scientific niche.
- The paper can be annotated by the casual reader, or reviewed by a professional with a publication history similar to the history of the submitter. The latter allows an expert review to be performed. The cost of the expert review would be borne by the submitter. The expert would have two incentives: being paid for the review work, but also their ability to publish would be predicated on their history of performing peer reviews. In order to publish, you must have previously participated in peer-reviews. Reviewers would be able to comment on sections, highlight areas of the paper that needed work, etc. Think of it as the scientific equivalent of a code review, using tools similar to Atlassian’s Crucible.
- Reviewers would be able to comment on the data and the workflow. Currently tools like Google Docs Spreadsheet allow you to insert comments on individual cells in a spreadsheet. You could have both the initial raw data, and a data set that had the outliers removed. Each outlier could have a comment associated with it, telling the reviewer why the outlier was removed — “clogged pipetter tip”, “edge effect on 96 well plate”, etc. If the reviewer executed the workflow, they would be able to leave comments on individual steps used in the workflow. If writer had selected a processor that performed curve-fitting using a particular algorithm, and the reviewer knew that another algorithm was better suited to the data, the reviewer, could add the new algorithm to workflow and have it automatically versioned to distinguish it from the original workflow.
- Reviewers would also be able to “digg” an article (similar to the model used on the social news site Digg.com). Each digg has both an overt weight (supplied by the reviewer) and a weight based on the reputation of the reviewer. Users would be able to see both the overt score and the weighted score. This means that even if all of the heavyweights in your field reject the thesis of your paper, it can still be balanced out by the overt score given by non-experts in the field. So if you’re the next Einstein submitting to a hostile audience, your paper can still be published.
- The submitter would be able to see the comments in one cohesive view (similar to end-notes in a paper), or see them within the context of the paper. They would also be able to respond to comments in the manner of a threaded discussion, and create “bug reports” (basically a task list of fixes needed for the paper).
- When the paper is updated with any corrections or changes, the reviewers would be automatically notified in the same manner that one is notified whenever a bug that you reported is fixed.
- Readers and reviewers alike would be able to see the paper in its original language or translated into another language via something like Google Translate. I remember reading that a Japanese meteorologist had described the Jet Stream years before in a Japanese journal, but the rest of the community had never seen the article.
- The “paper” could be versioned. Just like software, the user should be able to version parts of the paper and create a new “release” of the paper after a significant number of changes had been made.
- The complete text of the paper (and all associated data) could be published to PubMed Central after the peer-review process had been completed.
- The paper should be navigable as well as readable. The reader could look at the paper as a navigable set of nodes like a mindmap, and then switch to a view that resembles how the paper might appear in a journal.
- Using the references and the content of the paper, a user could “cluster” this paper with similar papers. A users publication history could also be used to find new papers of interest.
As with the current open access approach most of the costs are borne by the submitter. These costs are usually factored into the grant, so in that respect nothing would change.
- Data comes in a variety of formats — can the data be viewed without having to install lots of plugins? Let’s say that your paper described a new putative pathway. Your “data” section might contain raw microarray data straight off the instrument, a CSV version of that file, and perhaps an SBML version of the pathway that you constructed, and a visualization of that pathway in SVG. If I used a specific version of Cytoscape and a specific version of a Cytoscape plugin to create my pathway, I would need to provide links to both. What if one of those links couldn’t be resolved any more? Do I really want to install an old version of Cytoscape and an old version of the plugin in order to see this data? Could I use a combination of Fresnel lenses and Taverna workflows to solve this problem?
- Do journals make any sense in this model? Why not just tag the article and anyone subscribing to that tag would see the article in their own personalized journal?
- What happens if you can’t afford to publish the article and have it reviewed? Could professional societies create a general publication fund?
- Does it make for better science? All of the publishing models have some degree of bias associated with them. Does this model address those, or just create new biases?
What do you think? Is this a workable model?