Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Reproducibility vs replication - Definitional variations

There is confusion between the definitions of reproducibility, repeatability and replicability. I strongly feel we need to tackle that head on and come to an agreed definition. I prefer Peng 2011:

  • Reproduciblity is using the same data and getting the exact same result.
  • Replication is getting a new sample and doing the analysis again and getting a similar result.
Peng, R. D. (2011). Reproducible research in computational
science. Science, 334(6060), 1226–1227. doi:10.1126/science.1213847

There are many people using these interchangeably or around the opposite way. For example Drummond got it round the wrong way in ‘Drummond, C., 2009. Replicability is not reproducibility: nor is it good science’ http://cogprints.org/7691/7/icmlws09.pdf and then reverted it in ‘Reproducible Research: a Dissenting Opinion’. http://cogprints.org/8675/1/ReproducibleResearch.pdf (Check out Peng’s reaction: http://simplystatistics.org/2012/11/15/reproducible-research-with-us-or-against-us-3/)

And this blog but also gets the definitions around the wrong way http://jermdemo.blogspot.com.au/2012/12/the-reproducible-research-guilt-trip.html (even tho being quite entertaining to read and had this great picture…. not sure what the picture means???)

/images/thinker2.jpg

Sure, OK, it is fine that people define things differently to one another but:

The single biggest problem in communication 
is the illusion that it has taken place.
George Bernard Shaw quotes from BrainyQuote.com

We rely on a common defintion to ensure we are talking about the same thing.

/images/communication-bnewell.png

Source: Newell, B. (2012). Simple models, powerful ideas: Towards effective integrative practice. Global Environmental Change, 22(3), 776–783. http://dx.doi.org/10.1016/j.gloenvcha.2012.03.006

It is regrettable that in Ecology (my favourite discipline) there seems to be quite a wide gap between various author’s definitions. In CASSEY, P., & BLACKBURN, T. M. (2006). Reproducibility and Repeatability in Ecology. BioScience, 56(12), 958. http://bioscience.oxfordjournals.org/content/56/12/958.full they agree with the definition of Peng 2011. However another author gives a very confused and overlapping view:


because that context changes through time and space, it is virtually
impossible to reproduce precisely or quantitatively any single
experimental or observational field study in ecology. Yet many
ecological studies can be repeated. In particular, ecological
synthesis – the assembly of derived datasets and their subsequent
analysis, re-analysis, and meta-analysis – should be easy to repeat
and reproduce

Ellison, A. (2010). Repeatability and Transparency in Ecological Research. Ecology. https://dash.harvard.edu/bitstream/handle/1/3123279/Ellison_Repeatability.pdf?sequence=2 Accessed 12 Jan 16

In another interesting approach Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The Economics of Reproducibility in Preclinical Research. PLOS Biology, 13(6), e1002165. http://dx.doi.org/10.1371/journal.pbio.1002165 chose instead to define irreproducibility such that it:

that encompasses the existence and propagation of one or more errors,
flaws, inadequacies, or omissions (collectively referred to as errors) 
that prevent replication of results

Leaving us to assume that the opposite of this is therefore reproducibility, although avoiding defining this themselves. Looking back at the two heads in the picture above… it is interesting to ponder how some people would receive the signal of Freedman et al, having defined the opposite of the thing that is the object of their discussion, rather than the thing itself!

Let’s all agree with Peng and Cassey/Blackburn and move on already!

Posted in  disentangle


Validity of measurement

I have needed to describe validity recently and found it useful to paraphrase some of this statistics blog post: http://andrewgelman.com/2015/04/28/whats-important-thing-statistics-thats-not-textbooks/

I’ve been working a lot on air pollution modelling recently, where ‘validation’ is used to assess how well the modelled pollution predicted values represent the actual pollution observed. I tend to think of validity as formalised in statistical terms, i.e. as correlations between different measurements of the same thing, or between measurement and ‘truth’, and statistics are used for assessing and calibrating measurements.

I am guessing that when applied to the validity behind a research proposal, the issue might be whether the measurement is suitable for addressing the issue the researcher (and research question) is interested in, and therefore supporting the researchers to make valid inferences from the outcome of statistical methods. I have heard lots of anecdotes from statisticians in which when they are asked to help with analysing data, their advice was essentially that in order for a valid analysis that addresses the research question one would rather need to have collected different measurements.

Posted in  disentangle statistical modelling


We have a statistically rigorous and scientifically meaningful definition of replication. Let's use it

Researchers writing about the ‘Reproducibility Crisis’ often conflate the terms reproducibility, repeatability and replicability, but it is quite important to distinguish these. There is a great discussion of the distinction in the SimplyStatistics blog post titled: We need a statistically rigorous and scientifically meaningful definition of replication. But I actually now think we DO have that definition! It is the that repeatability is the same as replication and involves a new sample with new measurement errors while reproducibility uses the same data to recalculate the result.

I think it is vital that we labour the point so that the distinction between repeatability and reproducibility is made clear.

I follow the definition that ‘reproducible’ is exact re-computation whereas repeatable/replicable is a new analysis, of a new sample, yielding a new result (plus or minus some variance from measurement error), and if the original study is replicated then the same conclusions are reached from the new analysis.

I used this paragraph and recent reference in my thesis:

Reproducibility is defined as ‘the ability to recompute data analytic
results given an observed dataset and knowledge of the data analysis
pipeline’ (Leek & Peng 2015). This definition distinguishes
reproducibility from replicability which is ‘the chance that an
independent experiment targeting the same scientific question will
produce a consistent result’ (Leek & Peng 2015).  

Leek, J.T. & Peng, R.D. (2015). Opinion: Reproducible research can
still be wrong: Adopting a prevention approach. Proceedings of the
National Academy of Sciences of the United States of America, 112(6),
1645–1646.  

But I am also a big fan of the the definitions in Peng 2011 and Cassey 2006.

  • Peng 2011:
With replication, independent investigators address a scientific
hypothesis and build up evidence for or against it...

Reproducibility calls for the data and computer code used to analyze
the data be made available to others. This standard falls short of
full replication because the same data are analysed again, rather than
analysing independently collected data.

Peng, R. D. (2011). Reproducible research in computational
science. Science, 334(6060), 1226–1227. doi:10.1126/science.1213847

  • Cassey 2006:
[For a repeatable study] a third party must be able to perform a study
using identical methodological proto- cols and analyze the resulting
data in an identical manner... [Further] a published result
must be presented in a manner that allows for a quantitative
comparison in a later study...

We consider a study reproducible if, from the information presented in
the study, a third party could replicate the re- ported results
identically. 

CASSEY, P., & BLACKBURN, T. M. (2006). Reproducibility and
Repeatability in Ecology. BioScience,
56(12), 958. doi:10.1641/0006-3568

To fully endorse any scientific claims, the experimental findings should be completely repeatable by many independent investigators who ‘address areplicated hypothesis and build up evidence for or against it’ (Peng, 2011). It is important to note that the exact results need not be computed in a repeatable study. This is because experimentation involves probability, and if performed again, with a different sample and new set of measurement errors some variance between experiments is to be expected.

Posted in  disentangle reproducible research


Exemplars of distributing data and code - rOpenSci

The practice of distributing data and code has a wide variety of possible approaches. There are many resources available to be used to post data and code to the internet for dissemination, and it is very easy to access these resources. It is more difficult to find exemplars of how data and code are easily and effectively distributed. I am conducting a review of some of the resources that describe procedures for this and present exemplars in this and following notes.

A paper that describes the rOpenSci project’s approach is http://dx.doi.org/10.5334/jors.bu:

Boettiger, C., Chamberlain, S., Hart, E., & Ram,
K. (2015). Building Software, Building Community: Lessons from the
rOpenSci Project. Journal of Open Research Software,
3(1). 

This paper is focused on the way the community development and capacity building part of the project was conducted. The diagram shown below is introduced as an example of the style that rOpenSci recommend a data analysis workflow be constructed.

/images/workflow-ropensci.png

The focus on publishing data to a public repository so early in the project (prior to final analysis and manuscript) seems premature to me. But then, I do feel that I am somewhat more concerned with vexatious activity by climate skeptics than the rOpenSci team.

Posted in  disentangle reproducible research pipelines


My Framework Of Scientific Workflow And Integration Software For Holistic Data Analysis

*   /home/
**    /overview.org 
           - summary data_inventory
           - DMP
**    /worklog.org    
           - YYYY-MM-DD
*   /projects/
**    /project1_data_analysis_project_health_research
***       /dataset1_merged_health_outcomes_and_exposures
             - index.org
             - git (local private, gitignore all subfolders)
             - workplan
             - worklog
             - workflow
             - main.Rmd
****         /data1_provided
****         /data2_derived
*****            - workflow script
****         /code
****         /results/  (this has all the pathways explored)
*****           - README.md
                 - git (public Github)
                 /YYYY-MM-DD-shortname (i.e. EDA, prelim, model-selection, sensitivity)
                     /main.Rmd
                     /code/
                     /data/
****         /report/
                   /manuscript.Rmd
                     - main results recomputed in production/publication quality
                     - supporting_information (but also can refer to github/results)
                 /figures_and_tables/
                     - png
                     - csv
*****           /journal_submission/
                     - cover letter
                     - approval signatures
                     - submitted manuscript
*****           /journal_revision/
                     - response.org
**    /project2_data_analysis_project_exposure_assessment
           - index.org
           - git
***       /dataset2.1_monitored_data
              - workplan
              - worklog
              - workflow
****         /data1_provided
****         /data2_derived 
                 - stored here or
                 - web2py crud or
                 - geoserver
              /data1_and_data2_backups
              /reports/
                 - manuscript.Rmd -> publish with the data somehow
              /tools (R package)
                 - git/master -> Github
****      /dataset2.2_GIS_layers 
**    /methods_or_literature_review_project
*  /tools/
         /web2py
             /applications
                 /data_inventory
                     - holdings
                     - prospective
                 /database_crud
          /disentangle (R package)
          /pipeline_templates
**   /data/
         /postgis_hanigan
         /postgis_anu_gislibrary
         /geoserver_anu_gislibrary
**   /references/
         - mendeley
         - bib
         - PDFs annotated
**   /KeplerData/workflows/MyWorkflows/
***      /data_analysis_workflow_using_kepler (implemented as an R package)
****         /inst/doc/A01_load.R
***      /data_analysis_workflow_using_kepler (implemented as an R LCFD workflow)
             - main.Rmd (raw R version)
             - main.xml (this is kepler)
****         /data/
                 - file1.csv
                 - file2.csv
****         /code/
                 - load.R

name: my-framework-of-scientific-workflow-and-integration-software-for-holistic-data-analysis layout: post title: My framework of scientific workflow and integration software for holistic data analysis date: 2015-12-22 categories:

  • data management
  • swish

Scientific workflow and integration software for holistic data analysis (SWISH) is a title I have given to describe the area of my research that focuses on the tools and techniques of reproducible data analysis.

Reproducibility is the ability to recompute the results of a data analysis with the original data. It is possible to have analyses that are reproducible with varying degrees of difficulty. A data analysis might be reproducible but require thousands of hours of work to piece together the datasets, transformations, manipulations, calculations and interpretations of computational results. A primary challenge to reproducible data analysis is to make analyses that are easy to reproduce.

To achieve this, a guiding principle is that analysts should effectively implement ‘pipelines’ of method steps and tools. Data analysts should employ standardised and evidence-based methods based on conventions developed from many data analysts approaching the problems in a similar way, rather than each analyst configuring pipelines to suit particular individual or domain-specific preferences.

Planning and implementing a pipeline

It can be much easier to conceptualise a complicated data analysis method than to implement this as a reproducible research pipeline. The most effective way to implement a pipeline is by methodically tracking each of the steps taken, the data inputs needed and all the outputs of the step. If done in a disciplined way then the analyst or some other person could ‘audit’ the procedure easily and access the details of the pipeline they need to scrutinise.

Toward a standardised data analysis pipeline framework

In my own work I have tried a diverse variety of configurations based on things I have read and discussions I have had. Coming to the end of my PhD project I have reflected on the framework that I have arrived at and present this below as a schematic overview.

Posted in