Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Validity of measurement

I have needed to describe validity recently and found it useful to paraphrase some of this statistics blog post: http://andrewgelman.com/2015/04/28/whats-important-thing-statistics-thats-not-textbooks/

I’ve been working a lot on air pollution modelling recently, where ‘validation’ is used to assess how well the modelled pollution predicted values represent the actual pollution observed. I tend to think of validity as formalised in statistical terms, i.e. as correlations between different measurements of the same thing, or between measurement and ‘truth’, and statistics are used for assessing and calibrating measurements.

I am guessing that when applied to the validity behind a research proposal, the issue might be whether the measurement is suitable for addressing the issue the researcher (and research question) is interested in, and therefore supporting the researchers to make valid inferences from the outcome of statistical methods. I have heard lots of anecdotes from statisticians in which when they are asked to help with analysing data, their advice was essentially that in order for a valid analysis that addresses the research question one would rather need to have collected different measurements.

Posted in  disentangle statistical modelling


We have a statistically rigorous and scientifically meaningful definition of replication. Let's use it

Researchers writing about the ‘Reproducibility Crisis’ often conflate the terms reproducibility, repeatability and replicability, but it is quite important to distinguish these. There is a great discussion of the distinction in the SimplyStatistics blog post titled: We need a statistically rigorous and scientifically meaningful definition of replication. But I actually now think we DO have that definition! It is the that repeatability is the same as replication and involves a new sample with new measurement errors while reproducibility uses the same data to recalculate the result.

I think it is vital that we labour the point so that the distinction between repeatability and reproducibility is made clear.

I follow the definition that ‘reproducible’ is exact re-computation whereas repeatable/replicable is a new analysis, of a new sample, yielding a new result (plus or minus some variance from measurement error), and if the original study is replicated then the same conclusions are reached from the new analysis.

I used this paragraph and recent reference in my thesis:

Reproducibility is defined as ‘the ability to recompute data analytic
results given an observed dataset and knowledge of the data analysis
pipeline’ (Leek & Peng 2015). This definition distinguishes
reproducibility from replicability which is ‘the chance that an
independent experiment targeting the same scientific question will
produce a consistent result’ (Leek & Peng 2015).  

Leek, J.T. & Peng, R.D. (2015). Opinion: Reproducible research can
still be wrong: Adopting a prevention approach. Proceedings of the
National Academy of Sciences of the United States of America, 112(6),
1645–1646.  

But I am also a big fan of the the definitions in Peng 2011 and Cassey 2006.

  • Peng 2011:
With replication, independent investigators address a scientific
hypothesis and build up evidence for or against it...

Reproducibility calls for the data and computer code used to analyze
the data be made available to others. This standard falls short of
full replication because the same data are analysed again, rather than
analysing independently collected data.

Peng, R. D. (2011). Reproducible research in computational
science. Science, 334(6060), 1226–1227. doi:10.1126/science.1213847

  • Cassey 2006:
[For a repeatable study] a third party must be able to perform a study
using identical methodological proto- cols and analyze the resulting
data in an identical manner... [Further] a published result
must be presented in a manner that allows for a quantitative
comparison in a later study...

We consider a study reproducible if, from the information presented in
the study, a third party could replicate the re- ported results
identically. 

CASSEY, P., & BLACKBURN, T. M. (2006). Reproducibility and
Repeatability in Ecology. BioScience,
56(12), 958. doi:10.1641/0006-3568

To fully endorse any scientific claims, the experimental findings should be completely repeatable by many independent investigators who ‘address areplicated hypothesis and build up evidence for or against it’ (Peng, 2011). It is important to note that the exact results need not be computed in a repeatable study. This is because experimentation involves probability, and if performed again, with a different sample and new set of measurement errors some variance between experiments is to be expected.

Posted in  disentangle reproducible research


Exemplars of distributing data and code - rOpenSci

The practice of distributing data and code has a wide variety of possible approaches. There are many resources available to be used to post data and code to the internet for dissemination, and it is very easy to access these resources. It is more difficult to find exemplars of how data and code are easily and effectively distributed. I am conducting a review of some of the resources that describe procedures for this and present exemplars in this and following notes.

A paper that describes the rOpenSci project’s approach is http://dx.doi.org/10.5334/jors.bu:

Boettiger, C., Chamberlain, S., Hart, E., & Ram,
K. (2015). Building Software, Building Community: Lessons from the
rOpenSci Project. Journal of Open Research Software,
3(1). 

This paper is focused on the way the community development and capacity building part of the project was conducted. The diagram shown below is introduced as an example of the style that rOpenSci recommend a data analysis workflow be constructed.

/images/workflow-ropensci.png

The focus on publishing data to a public repository so early in the project (prior to final analysis and manuscript) seems premature to me. But then, I do feel that I am somewhat more concerned with vexatious activity by climate skeptics than the rOpenSci team.

Posted in  disentangle reproducible research pipelines


My Framework Of Scientific Workflow And Integration Software For Holistic Data Analysis

*   /home/
**    /overview.org 
           - summary data_inventory
           - DMP
**    /worklog.org    
           - YYYY-MM-DD
*   /projects/
**    /project1_data_analysis_project_health_research
***       /dataset1_merged_health_outcomes_and_exposures
             - index.org
             - git (local private, gitignore all subfolders)
             - workplan
             - worklog
             - workflow
             - main.Rmd
****         /data1_provided
****         /data2_derived
*****            - workflow script
****         /code
****         /results/  (this has all the pathways explored)
*****           - README.md
                 - git (public Github)
                 /YYYY-MM-DD-shortname (i.e. EDA, prelim, model-selection, sensitivity)
                     /main.Rmd
                     /code/
                     /data/
****         /report/
                   /manuscript.Rmd
                     - main results recomputed in production/publication quality
                     - supporting_information (but also can refer to github/results)
                 /figures_and_tables/
                     - png
                     - csv
*****           /journal_submission/
                     - cover letter
                     - approval signatures
                     - submitted manuscript
*****           /journal_revision/
                     - response.org
**    /project2_data_analysis_project_exposure_assessment
           - index.org
           - git
***       /dataset2.1_monitored_data
              - workplan
              - worklog
              - workflow
****         /data1_provided
****         /data2_derived 
                 - stored here or
                 - web2py crud or
                 - geoserver
              /data1_and_data2_backups
              /reports/
                 - manuscript.Rmd -> publish with the data somehow
              /tools (R package)
                 - git/master -> Github
****      /dataset2.2_GIS_layers 
**    /methods_or_literature_review_project
*  /tools/
         /web2py
             /applications
                 /data_inventory
                     - holdings
                     - prospective
                 /database_crud
          /disentangle (R package)
          /pipeline_templates
**   /data/
         /postgis_hanigan
         /postgis_anu_gislibrary
         /geoserver_anu_gislibrary
**   /references/
         - mendeley
         - bib
         - PDFs annotated
**   /KeplerData/workflows/MyWorkflows/
***      /data_analysis_workflow_using_kepler (implemented as an R package)
****         /inst/doc/A01_load.R
***      /data_analysis_workflow_using_kepler (implemented as an R LCFD workflow)
             - main.Rmd (raw R version)
             - main.xml (this is kepler)
****         /data/
                 - file1.csv
                 - file2.csv
****         /code/
                 - load.R

name: my-framework-of-scientific-workflow-and-integration-software-for-holistic-data-analysis layout: post title: My framework of scientific workflow and integration software for holistic data analysis date: 2015-12-22 categories:

  • data management
  • swish

Scientific workflow and integration software for holistic data analysis (SWISH) is a title I have given to describe the area of my research that focuses on the tools and techniques of reproducible data analysis.

Reproducibility is the ability to recompute the results of a data analysis with the original data. It is possible to have analyses that are reproducible with varying degrees of difficulty. A data analysis might be reproducible but require thousands of hours of work to piece together the datasets, transformations, manipulations, calculations and interpretations of computational results. A primary challenge to reproducible data analysis is to make analyses that are easy to reproduce.

To achieve this, a guiding principle is that analysts should effectively implement ‘pipelines’ of method steps and tools. Data analysts should employ standardised and evidence-based methods based on conventions developed from many data analysts approaching the problems in a similar way, rather than each analyst configuring pipelines to suit particular individual or domain-specific preferences.

Planning and implementing a pipeline

It can be much easier to conceptualise a complicated data analysis method than to implement this as a reproducible research pipeline. The most effective way to implement a pipeline is by methodically tracking each of the steps taken, the data inputs needed and all the outputs of the step. If done in a disciplined way then the analyst or some other person could ‘audit’ the procedure easily and access the details of the pipeline they need to scrutinise.

Toward a standardised data analysis pipeline framework

In my own work I have tried a diverse variety of configurations based on things I have read and discussions I have had. Coming to the end of my PhD project I have reflected on the framework that I have arrived at and present this below as a schematic overview.

Posted in 


This is an open notebook but selected content delayed

Open Notebook Science, Selected Content, Delayed (ONS-SCD)

I am trying to juggle my work in a dual Open-And-Closed way.

To explain: I try to keep an electronic ‘Open Notebook’ that aligns with the principles of the Open Notebook Science (ONS) movement’s ‘Selected Content – Delayed’ category (ONS-SCD). Back in 2012 when I started my notebook I looked around for models of what style of publication I wanted. I knew that some of my work was owned by the university I work at, and I am not allowed to publish this openly. Then there is other stuff I owned as part of my PhD, but that I might not want to release all the details of my work. So I settled on a ‘Selected Content - Delayed’ category and got the logo shown here from the (now-defunct) website http://onsclaims.wikispaces.com/. The ONS movement is still described on Wikipedia though https://en.wikipedia.org/wiki/Open_notebook_science.

/images/ONS-SCD.png

In this publication model I make publicly available the content of my research notebook (like a blog), in which I write reports of the details of the data, code and documents related to my research. I selectively make material open on github, and I sometimes delay publication of the material that I keep in my private research notebook. That work is kept private either because it includes unpublished work that I wish to keep embargoed until after publication, or because it is all the gory details of process of writing code to create or analyse data that is not appropriate for open publication.

In previous work I have either paid for additional private repos on github, and made the repo open once the paper is published, or alternately used bitbucket with unlimited free private repos for university students and then just put together a public repo for sharing ‘polished’ outputs.

The upshot is that I use this part open / part closed approach during the data exploration, cleaning, analysis and writing. In my opinion as long as the final workflow is clearly and openly documented and reproducible, that’s the most important thing.

The motivation stems back to the Climategate scandal and infamous ‘Harry Readme’ file

My supervisors over the years have all been really supportive of working in an open way and I have flirted with the idea of being completely open. However, I got a little worried about the implications of working too openly when malicious people might dig though my work for vexatious reasons, such as looking for errors or embarrassing comments I might inadvertently make that, when taken out of context, might make me sound foolish.

This sounds far fetched, but as an example of this, a few years ago there was a fair amount of heat generated by a lot of emails and other documents from the University of East Anglia Climate Research University. I was particularly interested because I was struggling to make sense of a lot of weird and wonderful databases and I felt a lot of sympathy for ‘Harry’, someone who as far as I could tell was doing a pretty good job of exploring, cleaning and documenting their work.

Here is one journalists summary of this issue http://blogs.telegraph.co.uk/technology/iandouglas/100004334/harry_read_me-txt-the-climategate-gun-that-does-not-smoke/:


the contents of the harry_read_me.txt file, apparently leaked from the
University of East Anglia and now becoming a totem for climate change
sceptics to gather around as though it were a piece of the true cross.

This file – thousands of lines of annotations kept on the process of
re-developing a computer model of the climate form figures submitted
by weather stations around the world and other historical data sets –
holds a personal commentary written by an un-named developer (let's
call him Harry), frustrated and often tied up in knots, working late
into the night and the weekend trying to squeeze differently-formatted
numbers into a consistent narrative.  

Using git and Github in an ONS-SCD model

/projectname (eg msms)/
    /doc/
        /ms-analysis.html 
        /paper/
            /msms.tex
            /msms.pdf
    /data/
        /YYYY-MM-DD/
            /yeast/
                /README
                /yeast.sqt
            /worm/
                /README
                /worm.sqt
    /src/
        /ms-analysis.c
    /bin/
        /parse-sqt.py
    /results/
        /notebook.html 
        /YYYY-MM-DD-1/
            /runall
            /split1/
            /split2/
        /YYYY-MM-DD-2/
            /runall

I want to publish my results, rather than my process


    the 'Experiment Results' level is about work you might do on a 
       single day, or over a week

    Workflow scripts: At this level each 'experiment' is written up in
    chronological order, as entries to the Worklog at the meso level

    Noble recommends 'create either a README file, in which I store
    every command line that I used while performing the experi- ment,
    or a driver script (I usually call this runall) that carries out
    the entire experiment automatically'...

    and 'you should end up with a file that is parallel to the lab
    notebook entry. The lab notebook contains a prose description of
    the exper- iment, whereas the driver script contains all the gory
    details.'

    this is the level I usually think of managing the distribution
    side of things. I will want to pack up the results and email to my
    collaborators, or decide on the one set of tables and figures to
    write into the manuscript for submission to a journal. If this is
    accepted for publication, this is the one combined package of
    'analytical data and code' that I would consider putting up online
    (to github) as supporting information for the paper.

Public Github repo within a private local overview git repo: My setup

  • I mostly use one single Emacs orgmode file to run the whole project, using tangle to send chunks of code to scripts, after testing them out using the library of babel
  • To keep this version controlled I created a git repo for this
  • To test out have created a fake-data-analysis-project and this includes a local git repository
  • in the .gitignore file I added the commend * to ignore all subfolders and files
  • If I want to add files to this I need to use git add -f thefile
  • Then I create a public github repo in the results folder (I named the repo: THE-PROJECT-NAME-results

$ cd ~/projects/fake-data-analysis-project
$ mkdir results
$ cd results/
/results$ git init
Initialized empty Git repository in /home/ivan_hanigan/tools/ReproducibleResearchPipelineTemplate/results/.git/
/results$ mkdir 2015-12-20-eda
/results$ git remote add origin git@github.com:ivanhanigan/ReproducibleResearchPipelineTemplate-results.git
$ git push -u origin master

The Result

Posted in  disentangle