Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

This is an open notebook but selected content delayed

Open Notebook Science, Selected Content, Delayed (ONS-SCD)

I am trying to juggle my work in a dual Open-And-Closed way.

To explain: I try to keep an electronic ‘Open Notebook’ that aligns with the principles of the Open Notebook Science (ONS) movement’s ‘Selected Content – Delayed’ category (ONS-SCD). Back in 2012 when I started my notebook I looked around for models of what style of publication I wanted. I knew that some of my work was owned by the university I work at, and I am not allowed to publish this openly. Then there is other stuff I owned as part of my PhD, but that I might not want to release all the details of my work. So I settled on a ‘Selected Content - Delayed’ category and got the logo shown here from the (now-defunct) website http://onsclaims.wikispaces.com/. The ONS movement is still described on Wikipedia though https://en.wikipedia.org/wiki/Open_notebook_science.

/images/ONS-SCD.png

In this publication model I make publicly available the content of my research notebook (like a blog), in which I write reports of the details of the data, code and documents related to my research. I selectively make material open on github, and I sometimes delay publication of the material that I keep in my private research notebook. That work is kept private either because it includes unpublished work that I wish to keep embargoed until after publication, or because it is all the gory details of process of writing code to create or analyse data that is not appropriate for open publication.

In previous work I have either paid for additional private repos on github, and made the repo open once the paper is published, or alternately used bitbucket with unlimited free private repos for university students and then just put together a public repo for sharing ‘polished’ outputs.

The upshot is that I use this part open / part closed approach during the data exploration, cleaning, analysis and writing. In my opinion as long as the final workflow is clearly and openly documented and reproducible, that’s the most important thing.

The motivation stems back to the Climategate scandal and infamous ‘Harry Readme’ file

My supervisors over the years have all been really supportive of working in an open way and I have flirted with the idea of being completely open. However, I got a little worried about the implications of working too openly when malicious people might dig though my work for vexatious reasons, such as looking for errors or embarrassing comments I might inadvertently make that, when taken out of context, might make me sound foolish.

This sounds far fetched, but as an example of this, a few years ago there was a fair amount of heat generated by a lot of emails and other documents from the University of East Anglia Climate Research University. I was particularly interested because I was struggling to make sense of a lot of weird and wonderful databases and I felt a lot of sympathy for ‘Harry’, someone who as far as I could tell was doing a pretty good job of exploring, cleaning and documenting their work.

Here is one journalists summary of this issue http://blogs.telegraph.co.uk/technology/iandouglas/100004334/harry_read_me-txt-the-climategate-gun-that-does-not-smoke/:


the contents of the harry_read_me.txt file, apparently leaked from the
University of East Anglia and now becoming a totem for climate change
sceptics to gather around as though it were a piece of the true cross.

This file – thousands of lines of annotations kept on the process of
re-developing a computer model of the climate form figures submitted
by weather stations around the world and other historical data sets –
holds a personal commentary written by an un-named developer (let's
call him Harry), frustrated and often tied up in knots, working late
into the night and the weekend trying to squeeze differently-formatted
numbers into a consistent narrative.  

Using git and Github in an ONS-SCD model

/projectname (eg msms)/
    /doc/
        /ms-analysis.html 
        /paper/
            /msms.tex
            /msms.pdf
    /data/
        /YYYY-MM-DD/
            /yeast/
                /README
                /yeast.sqt
            /worm/
                /README
                /worm.sqt
    /src/
        /ms-analysis.c
    /bin/
        /parse-sqt.py
    /results/
        /notebook.html 
        /YYYY-MM-DD-1/
            /runall
            /split1/
            /split2/
        /YYYY-MM-DD-2/
            /runall

I want to publish my results, rather than my process


    the 'Experiment Results' level is about work you might do on a 
       single day, or over a week

    Workflow scripts: At this level each 'experiment' is written up in
    chronological order, as entries to the Worklog at the meso level

    Noble recommends 'create either a README file, in which I store
    every command line that I used while performing the experi- ment,
    or a driver script (I usually call this runall) that carries out
    the entire experiment automatically'...

    and 'you should end up with a file that is parallel to the lab
    notebook entry. The lab notebook contains a prose description of
    the exper- iment, whereas the driver script contains all the gory
    details.'

    this is the level I usually think of managing the distribution
    side of things. I will want to pack up the results and email to my
    collaborators, or decide on the one set of tables and figures to
    write into the manuscript for submission to a journal. If this is
    accepted for publication, this is the one combined package of
    'analytical data and code' that I would consider putting up online
    (to github) as supporting information for the paper.

Public Github repo within a private local overview git repo: My setup

  • I mostly use one single Emacs orgmode file to run the whole project, using tangle to send chunks of code to scripts, after testing them out using the library of babel
  • To keep this version controlled I created a git repo for this
  • To test out have created a fake-data-analysis-project and this includes a local git repository
  • in the .gitignore file I added the commend * to ignore all subfolders and files
  • If I want to add files to this I need to use git add -f thefile
  • Then I create a public github repo in the results folder (I named the repo: THE-PROJECT-NAME-results

$ cd ~/projects/fake-data-analysis-project
$ mkdir results
$ cd results/
/results$ git init
Initialized empty Git repository in /home/ivan_hanigan/tools/ReproducibleResearchPipelineTemplate/results/.git/
/results$ mkdir 2015-12-20-eda
/results$ git remote add origin git@github.com:ivanhanigan/ReproducibleResearchPipelineTemplate-results.git
$ git push -u origin master

The Result

Posted in  disentangle


blog comments powered by Disqus