AMP Workshop 2011: Reproducible Research: Tools and Strategies for Scientific Computing

Rationale for the Symposium

Computation has become a vital component of research in the applied areas of mathematics, and through them all areas of science and engineering. Academic publications or industrially relevant mathematical results that do not involve some aspect of computational analysis are few and far between. Unfortunately, the software and data that drives this computation is too often developed and managed in a haphazard fashion prone to error and difficult to replicate or build upon.

We aim in this workshop to gather speakers to discuss best practices for "reproducible research": The idea that research contributions in the computational sciences involve not only publication of an article in an academic venue, but also release of sufficient components of the software and data such that the results claimed in the publication can be reproduced and extended by other scientists.

Program

Day 1	Time	Speaker
July 13	all day	Tutorials
Day 2	Time	Speaker	Title
July 14	8:30am	John Wilbanks	"Freedom (to reproduce)"
July 14	9:15am	Andrew Davison	"Automated tracking of scientific computations"
July 14	10:30am	Roger Peng	"Computational and Policy Tools for Reproducible Research"
July 14	11:15am	Juliana Freire	"A Provenance-Based Infrastructure for Creating Reproducible Papers"
July 14	1:30pm	Philip Guo	"CDE: A tool for automatically creating reproducible experimental software packages"
July 14	2:15pm	Patrick Vandewalle	"Reproducible Research in Signal Processing: How to Increase Impact"
July 14	3:30pm	Tiffani Williams	"Paper Mâché : A Novel System for Executing Scientific Papers"
Day 3	Time	Speaker	Title
July 15	8:30am	Tony Hey	"Reproducible Research and Data-Intensive Scientific Discovery"
July 15	9:15am	Matan Gavish	"A Universal Identifier for Computational Results"
July 15	10:30am	Bill Howe	"Virtual Appliances, Cloud Computing, and Reproducible Research"
July 15	11:15am	James Quirk	"In Search of Computational Scholarship: Reproducible Research and Cotton Nero A.X."
July 15	1:30pm	Soran Mitran	"Archiving Computational Research in Virtual Machines"
July 15	2:15pm	Jarrod Millman	"The challenge of reproducible research in the computer age"
July 15	3:30pm	Victoria Stodden	"What is Reproducible Research? The Practice of Science Today and the Scientific Method"

Speakers

Talks are listed alphabetically by speaker.

Andrew Davison (Unité de Neurosciences Intégratives et Computationnelles, CNRS, Gif-sur-Yvette, France): "Automated tracking of scientific computations," (silverlight video with slide integration, youtube video, slides)

Reproducibility of experiments is one of the foundation stones of science. A related concept is provenance, being able to track a given scientific result, such as a figure in an article, back through all the analysis steps (verifying the correctness of each) to the original raw data, and the experimental protocol used to obtain it. In computational, simulation- or numerical analysis-based science, reproduction of previous experiments, and establishing the provenance of results, ought to be easy, given that computers are deterministic, not suffering from the problems of inter-subject and trial-to-trial variability that make reproduction of biological experiments, for example, more challenging. In general, however, it is not easy, due to the complexity of our code and our computing environments, and the difficulty of capturing every essential piece of information needed to reproduce a computational experiment using existing tools such as spreadsheets, version control systems and paper notebooks.
To ensure reproducibility of a computational experiment we need to record: (i) the code that was run, (ii) any parameter files and command line options, (iii) the platform on which the code was run, (iv) the outputs. To keep track of a research project with many hundreds or thousands of simulations and/or analyses, it is also useful to record (i) the reason for which the simulation/analysis was run and (ii) a summary of the outcome of the simulation/analysis. Recording the code might mean storing a copy of the executable, or the source code (including that of any libraries used), the compiler used (including version) and the compilation procedure (e.g. the Makefile, etc.) For interpreted code, it might mean recording the version of the interpreter (and any options used in compiling it) as well as storing a copy of the main script, and of any external modules or packages that are included or imported into the script file. For projects using version control, “storing a copy of the code” may be replaced with “recording the URL of the repository and the revision number”. The platform includes the processor architecture(s), the operating system(s), the number of processors (for distributed simulations), etc.
The traditional way of recording the information necessary to reproduce an experiment is by noting down all details in a paper notebook, together with copies or print-outs of any results. More modern approaches may replace or augment the paper notebook with a spreadsheet or other hand-rolled database, but still with the feature that all relevant information is entered by hand. In other areas of science, particularly in applied science laboratories with high-throughput, highly-standardised procedures, electronic lab notebooks and laboratory information management systems (LIMS) are in widespread use, but none of these tools seem to be well suited for tracking simulation experiments or novel analyses. In developing a tool for tracking simulation experiments/computational analyses, something like an electronic lab notebook for computational science, there are a number of challenges: (i) different researchers have very different ways of working and different workflows: command line, GUI, batch-jobs (e.g. in supercomputer environments), or any combination of these for different components (simulation, analysis, graphing, etc.) and phases of a projec; (ii) some projects are essentially solo endeavours, others collaborative projects, possibly distributed geographically; (iii) as much as possible should be recorded automatically. If it is left to the researcher to record critical details there is a risk that some details will be missed or left out, particularly under pressure of deadlines.
In this talk I will present the solution we are developing to the challenges outlined above. Sumatra consists of a core library, implemented in Python, on which is built a command line interface for launching simulations/analyses with automated recording of provenance information and a web interface for managing a computational project: browsing, viewing, and annotating simulations/analyses.
Sumatra (i) interacts with version control systems, such as Subversion, Git, Mercurial, or Bazaar, (ii) supports launching serial or distributed (via MPI) computations, (iii) links to data generated by the computation, (iv) aims to support all and any command-line drivable simulation or analysis program, (v) supports both local and networked storage of information, (vi) aims to be extensible, so that components can easily be added for new version control systems, etc., (vii) aims to be very easy to use, otherwise it will only be used by the very conscientious.

Juliana Freire and Claudio Silva (University of Utah): "A Provenance-Based Infrastructure for Creating Reproducible Papers" (silverlight video with slide integration, youtube video, slides)

While computational experiments have become an integral part of the scientific method, it is still a challenge to repeat such experiments, because often, computational experiments require specific hardware, non-trivial software installation, and complex manipulations to obtain results. Generating and sharing repeatable results takes a lot of work with current tools. Thus, a crucial technical challenge is to make this easier for (i) the author of the paper, (ii) the reviewer of the paper, and, if the author is willing to disseminate code to the community, (iii) the eventual readers of the paper. While a number of tools have been developed that attack sub-problems related to the creation of reproducible papers, no end-to-end solution is available. Besides giving authors the ability to link results to their provenance, such a solution should enable reviewers to assess the correctness and the relevance of the experimental results described in a submitted paper. Furthermore, upon publication, readers should be able to repeat and utilize the computations embedded in the papers. But even when the provenance associated with a result is available and contains a precise and executable specification of the computational process (i.e., a workflow), shipping the specification to be run in an environment different from the one it has been designed at raises many challenges. From hard-coded locations for input data, to dependencies on specific version of software libraries and hardware, adapting a workflow to run on a new environment can be challenging and sometimes impossible.
We posit that integrating data acquisition, derivation, analysis, and visualization as executable components throughout the publication process will make it easier to generate and share repeatable results. To this end, we have built an infrastructure to support the life-cycle of 'reproducible publications'---their creation, review and re-use. In particular, in our design we have considered the following desiderata: Lower Barrier for Adoption---it should help authors in the process of assembling their submissions; Flexibility---it should support multiple mechanisms that give authors different choices as how to package their work; Support for the Reviewing Process---reviewers should be able to unpack and reproduce the experiments, as well as validate them. We have used VisTrails?, a provenance-enabled, workflow-based data exploration tool, as a key component of our infrastructure. We leverage the VisTrails?' provenance infrastructure to systematically capture useful meta-data, including workflow provenance, source code, and library versions. We have also taken advantage of the extensibility of the system to integrate components and tools that address issues required to support reproducible papers, including: linking results to their provenance; the ability to repeat results, explore parameter spaces, and interact with results through a Web-based interface; the ability to upgrade the specification of computational experiments to work in different environments and with newer versions of software. In this talk, we outline challenges we have encountered and present some of the components we have developed to address them. We also present a demo where we show real-world uses of our infrastructure.

Philip Guo: "CDE: A tool for automatically creating reproducible experimental software packages" (silverlight video with slide integration, youtube video, slides)

Although there are many social, cultural, and political barriers to reproducible research, the main technical barrier to reproducibility is that it is hard to distribute scientific code in a form that other researchers can easily execute on their own machines. Before your colleagues can run your computational experiments, they must first obtain, install, and configure compatible versions of the appropriate software and their myriad of dependent libraries, which is a frustrating and error-prone process.
To eliminate this technical barrier to reproducibility, I have created a tool called CDE that automatically packages up all of the software dependencies required to reproduce your computational experiments on another machine. CDE is easy to use: All you need to do is execute the commands for your experiment under its supervision, and CDE automatically packages up all of the Code, Data, and Environment that your commands accessed. When you send that self-contained package to your colleagues, they can re-run those exact commands on their machines without first installing or configuring anything. Moreover, they can even adjust the parameters in your code and re-run to explore related hypotheses, or run your code on their own datasets to see how well it generalizes.
CDE currently only works on Linux, but the ideas it embodies can be implemented for any operating system. You can download CDE for free at http://www.stanford.edu/~pgbovine/cde.html.

Matan Gavish: "A Universal Identifier for Computational Results" (silverlight video with slide integration, youtube video, slides)

When we read online scientific publications, thanks to the notion of hyperlink and the infrastructure of the web, we can click on a citation and browse the cited work, and continue to work it itself cites, and so on.
What if we could click on any image or table in a scientific publication, and go to a detailed, structured description of its generating computation, and even land precisely on the instruction that created the figure? From there we could stroll up and down the computation tree and browse other parts of the same computation and other figures created by it. We could read the code, examine intermediate variables, understand the steps that took place, re-execute some parts, and retrieve the original dataset fed into the computation. In fact, what if we could move on to browse the original dataset's own creating computations, and continue to tour computation world, all through an entry point provided by one figure of interest in a publication?
The Verifiable Computational Research (VCR) is a discipline for computational research. It introduces the notion of Verifiable Result Identifier (VRI), which together with today's advanced web infrastructure turns the above fantasy into a reality. The discipline allows researchers, publishers and publication readers to use the same tools they are already using, and requires only minor changes to these tools. While everyone are following their familiar workflow, a VCR software system is working quietly in the background to make it all happen. For example, it automatically brands each publishable result produced by the computation with its own unique VRI. The VRI is at the same time a web URL and a secure digital signature. In any publication or presentation, the interested reader can click on the result, direct a web browser to the URL, or scan a barcode -- and gain entry into the computation that created the result, with implications as above.
For the individual researcher, VCR is the an online, self-filling lab journal for computational experiments. For computational science communities, VCR is a disciplined, standard, simple and automatic way to work reproducibly. It imposes simple rules, requires very minimal effort (the software is doing all the work), and needs absolutely no personal dedication to the reproducible research cause. As such, it might be a big step toward the long-anticipated promises of widely practiced, fully reproducible research.
I'll show an existing implementation of the VCR system, currently in use in the Stanford Statistics Department.
Joint work with D. Donoho.

Tony Hey: "Reproducible Research and Data-Intensive Scientific Discovery" (silverlight video with slide integration, youtube video, slides)

There is a sea change happening in academic research -- a transformation caused by a data deluge that is affecting all disciplines. Modern science increasingly relies on integrated information technologies and computation to collect, process, and analyze complex data. It was Ken Wilson, Nobel Prize winner in physics, who first coined the phrase “Third Paradigm” to refer to computational science and the need for computational researchers to know about algorithms, numerical methods, and parallel architectures. However, the skills needed for manipulating, visualizing, managing, and, finally, conserving and archiving scientific data are very different. “The Fourth Paradigm” is about the computational systems needed to manipulate, visualize, and manage large amounts of scientific data. A wide variety of scientists— biologists, chemists, physicists, astronomers, engineers – require tools, technologies, and platforms that seamlessly integrate into standard scientific methodologies and processes. One disturbing emerging trend is the difficulty in enabling scientists other than the authors of scientific papers to be able to replicate the often complex analysis steps required to reach the scientific conclusions of the papers. The talk will illustrate a possible partial solution to the problem of reproducible research based on a joint research project between Microsoft Research and the MIT Broad Institute.

Bill Howe: "Virtual Appliances, Cloud Computing, and Reproducible Research" (silverlight video with slide integration, youtube video, slides)

Science in every discipline is becoming data-intensive, requiring researchers to interact with their data solely through computational and statistical methods as opposed to direct manipulation. Perhaps paradoxically, these in silico experiments are often more difficult to reproduce than traditional "manual" laboratory techniques. Software pipelines used to acquire and process data have complex version-sensitive interdependencies, datasets are too large to efficiently transport from place to place, and interfaces are often complex and underdocumented.
At the UW eScience Institute, we are exploring the use of virtual machines and cloud computing to mitigate these challenges. A virtual machine can capture a researcher's entire working environment as a snapshot, including the data, software, dependencies, intermediate results, logs and other usage history information, operating system and file system context, convenience scripts, and more. These virtual machines can then be saved, made publicly available, and referenced in a publication. This approach not only facilitates reproducibility, but incurs essentially zero overhead for the researcher. Coupled with cloud computing, this approach offers additional benefits: experimenters need not allocate local resources to host the virtual machine, large datasets and long-running computations can be managed efficiently, and resource costs are more easily shared between producer and consumer.
In this talk, I'll motivate this approach with case studies from our experience and consider some of the implications and future directions.

Jarrod Millman: "The challenge of reproducible research in the computer age" (silverlight video with slide integration, youtube video, slides)

Computing is increasingly central to the practice of mathematical and scientific research. This has provided many new opportunities as well as new challenges. In particular, modern scientific computing has strained the ability of researchers to reproduce their own (as well as their colleagues') work. In this talk, I will outline some of the obstacles to reproducible research as well as some potential solutions and opportunities.

Sorin Mitran: "Archiving Computational Research in Virtual Machines" (silverlight video with slide integration, youtube video, slides)

Several approaches have been taken by computational scientists to ensure open access to their research codes: providing source codes, using a purpose-built archival system, literate programming tools. These procedures reflect standard practices in experimental sciences where laboratory techniques, supplies and equipment are documented in a research paper. Computational research has one advantage with respect to experimental science: our entire laboratory can be packaged and sent to independent parties for validation of research results. Virtualization has advanced to a stage in which direct access to graphics processing hardware and multiple CPU parallel processing can be included in virtual machines. The entire panoply of open-source tools for scripting and documentation can be included with the virtual machine.
This talk will present experience with this approach in the context of interdisciplinary research that uses two of the author's codes (BEARCLAW and Diapason). Particular attention is paid to documentation and use of the TeXmacs environment to present both theory and implementation of algorithmic ideas.

Roger Peng: "Computational and Policy Tools for Reproducible Research" (silverlight video with slide integration, youtube video, slides,)

The ability to make scientific findings reproducible is increasingly important in areas where substantive results are the product of complex statistical computations. Reproducibility can allow others to verify the published findings and conduct alternate analyses of the same data. A question that arises naturally is how can one conduct and distribute reproducible research? I describe a simple framework in which reproducible research can be conducted and distributed via cached computations and describe tools for both authors and readers. As a prototype implementation I describe a software package written in the R language. The `cacher' package provides tools for caching computational results in a key-value style database which can be published to a public repository for readers to download. As a case study I demonstrate the use of the package on a study of ambient air pollution exposure and mortality in the United States. I will also discuss the role that journals can play in encouraging reproducible research and will review the recent reproducibility policy at the journal Biostatistics.

James Quirk: "In Search of Computational Scholarship: Reproducible Research and Cotton Nero A.X." (silverlight video with slide integration, youtube video, slides)

December 2010, the publishing behemoth Elsevier issued an Executable Paper Grand Challenge: ``a contest created to improve the way scientific information is communicated and used.'' This contest, along with the reproducible research movement, represents a growing realization that computational science is ill-served by traditional journal-articles, with their static typeset text. In this talk I will present some of my executable-paper exploits.
The talk's title is borrowed from a television documentary by my near namesake, James Burke, the noted science historian. He argues that ``you see, what your knowledge tells you, you're seeing.'' And that when your knowledge changes, so your view of the universe changes. Thus my take on executable papers stems from many small dawnings, rather than a one-off Eureka!
December 1984, for instance, while working in the design department of a manufacturer of steam turbines, I received a severe dressing down for a slipshod calculation I had performed. As a result, I view executable-papers through a prism of accountability. One which forces me to discuss my exploits through the very framework I use to create executable papers. That way you can examine the associated software details, first-hand, and you do not need to take my word on trust.
Here is the executable PDF version of this abstract.

Victoria Stodden: "What is Reproducible Research? The Practice of Science Today and the Scientific Method" (silverlight video with slide integration, youtube video, slides)

Scientific computation is emerging as absolutely central to the scientific method, but the prevalence of very relaxed practices is leading to a credibility crisis in many scientific fields. It is impossible to verify most of the results that computational scientists present at conferences and in papers today. Computational science is error-prone and traditional scientific publication is incapable of finding and rooting out errors in scientific computation.
A necessary response to this crisis is reproducible research -- where all code and data underlying the published results is made openly available. In this talk I discuss the evolution of the practice of science and necessary corresponding changes in the scientific method, such as reproducibility. I discuss open licensing to facilitate code and data sharing and reuse, called the Reproducible Research Standard.

Patrick Vandewalle: "Reproducible Research in Signal Processing: How to Increase Impact" (silverlight video with slide integration, youtube video, slides)

Worries about the reproducibility of research results are centuries old, and date back to Descartes' work Discourse on (Scientific) Method. However, in the recently developed computational sciences, new approaches to reproducibility are required. In this presentation, I give an overview of our personal experiences with reproducible research in the field of signal and image processing. I will also present results from the reproducibility study that we did on image processing papers. Next, I discuss some of the typical issues we ran into when making our work reproducible. Finally, I give some indications of the increased impact of research results when they are made reproducible.

John Wilbanks: "Freedom (to reproduce)" (silverlight video with slide integration, youtube video, slides)

In culture and in commerce, networks have put the users at the center of design. We can buy, search, post, and connect...maybe too much. But science, education, and other research-driven fields have resisted this transition, continuing to focus on institutions in reacting to the network. The "digital commons" movement puts the user at the center of a legal layer of the network, and using commons-based design, allows for the rights to reproduce research as well as touching on the use of standards to describe and mark research for reuse.

Tiffani Williams: "Paper Mâché : A Novel System for Executing Scientific Papers" (silverlight video with slide integration, youtube video)

The increased use of computer software in science makes reproducing scientific results increasingly difficult. The research paper in its current state is no longer sufficient to fully reproduce, validate, or review a paper's experimental results and conclusions. We introduce Paper~M\^{a}ch\'{e}, a new system for creating dynamic, executable research papers. The key novelty of our system is the use of virtual machines, which allows scientists to view and interact with a paper and reproduce key experimental results. Thus, our system provides a bridge that allows everyone to actively participate in the scientific process.

Reproducible Research:

Tools and Strategies for Scientific Computing

Videos of the talks are available at http://mediasite.mediagroup.ubc.ca (silverlight required), and on youtube at the author links below or at http://www.youtube.com/user/victoriastodden (without the nifty silverlight slide integration). Slides available below.