Open Notebook Science, Selected Content, Delayed (ONS-SCD)
I am trying to juggle my work in a dual Open-And-Closed way.
To explain: I try to keep an electronic ‘Open Notebook’ that aligns with the principles of the Open Notebook Science (ONS) movement’s ‘Selected Content – Delayed’ category (ONS-SCD). Back in 2012 when I started my notebook I looked around for models of what style of publication I wanted. I knew that some of my work was owned by the university I work at, and I am not allowed to publish this openly. Then there is other stuff I owned as part of my PhD, but that I might not want to release all the details of my work. So I settled on a ‘Selected Content - Delayed’ category and got the logo shown here from the (now-defunct) website http://onsclaims.wikispaces.com/. The ONS movement is still described on Wikipedia though https://en.wikipedia.org/wiki/Open_notebook_science.
In this publication model I make publicly available the content of my research notebook (like a blog), in which I write reports of the details of the data, code and documents related to my research. I selectively make material open on github, and I sometimes delay publication of the material that I keep in my private research notebook. That work is kept private either because it includes unpublished work that I wish to keep embargoed until after publication, or because it is all the gory details of process of writing code to create or analyse data that is not appropriate for open publication.
In previous work I have either paid for additional private repos on github, and made the repo open once the paper is published, or alternately used bitbucket with unlimited free private repos for university students and then just put together a public repo for sharing ‘polished’ outputs.
The upshot is that I use this part open / part closed approach during the data exploration, cleaning, analysis and writing. In my opinion as long as the final workflow is clearly and openly documented and reproducible, that’s the most important thing.
The motivation stems back to the Climategate scandal and infamous ‘Harry Readme’ file
My supervisors over the years have all been really supportive of working in an open way and I have flirted with the idea of being completely open. However, I got a little worried about the implications of working too openly when malicious people might dig though my work for vexatious reasons, such as looking for errors or embarrassing comments I might inadvertently make that, when taken out of context, might make me sound foolish.
This sounds far fetched, but as an example of this, a few years ago there was a fair amount of heat generated by a lot of emails and other documents from the University of East Anglia Climate Research University. I was particularly interested because I was struggling to make sense of a lot of weird and wonderful databases and I felt a lot of sympathy for ‘Harry’, someone who as far as I could tell was doing a pretty good job of exploring, cleaning and documenting their work.
Here is one journalists summary of this issue http://blogs.telegraph.co.uk/technology/iandouglas/100004334/harry_read_me-txt-the-climategate-gun-that-does-not-smoke/:
the contents of the harry_read_me.txt file, apparently leaked from the
University of East Anglia and now becoming a totem for climate change
sceptics to gather around as though it were a piece of the true cross.
This file – thousands of lines of annotations kept on the process of
re-developing a computer model of the climate form figures submitted
by weather stations around the world and other historical data sets –
holds a personal commentary written by an un-named developer (let's
call him Harry), frustrated and often tied up in knots, working late
into the night and the weekend trying to squeeze differently-formatted
numbers into a consistent narrative.
Using git and Github in an ONS-SCD model
- Recall Noble’s framework? The results folder is what I want to publish
- Noble recommended the following folder and file structures http://dx.doi.org/10.1371/journal.pcbi.1000424.g001
- I revised his conceptual diagram, and I blogged about this at /2015/10/a-quick-review-of-a-quick-guide-to-organizing-computational-biology-projects/
/projectname (eg msms)/
/doc/
/ms-analysis.html
/paper/
/msms.tex
/msms.pdf
/data/
/YYYY-MM-DD/
/yeast/
/README
/yeast.sqt
/worm/
/README
/worm.sqt
/src/
/ms-analysis.c
/bin/
/parse-sqt.py
/results/
/notebook.html
/YYYY-MM-DD-1/
/runall
/split1/
/split2/
/YYYY-MM-DD-2/
/runall
I want to publish my results, rather than my process
- I had the realisation that /2015/10/how-to-effectively-implement-electronic-lab-notebooks-in-epidemiology/
the 'Experiment Results' level is about work you might do on a
single day, or over a week
Workflow scripts: At this level each 'experiment' is written up in
chronological order, as entries to the Worklog at the meso level
Noble recommends 'create either a README file, in which I store
every command line that I used while performing the experi- ment,
or a driver script (I usually call this runall) that carries out
the entire experiment automatically'...
and 'you should end up with a file that is parallel to the lab
notebook entry. The lab notebook contains a prose description of
the exper- iment, whereas the driver script contains all the gory
details.'
this is the level I usually think of managing the distribution
side of things. I will want to pack up the results and email to my
collaborators, or decide on the one set of tables and figures to
write into the manuscript for submission to a journal. If this is
accepted for publication, this is the one combined package of
'analytical data and code' that I would consider putting up online
(to github) as supporting information for the paper.
Public Github repo within a private local overview
git repo: My setup
- I mostly use one single Emacs orgmode file to run the whole project, using tangle to send chunks of code to scripts, after testing them out using the library of babel
- To keep this version controlled I created a git repo for this
- To test out have created a fake-data-analysis-project and this includes a local git repository
- in the
.gitignore
file I added the commend*
to ignore all subfolders and files - If I want to add files to this I need to use
git add -f thefile
- Then I create a public github repo in the results folder (I named the repo:
THE-PROJECT-NAME-results
$ cd ~/projects/fake-data-analysis-project
$ mkdir results
$ cd results/
/results$ git init
Initialized empty Git repository in /home/ivan_hanigan/tools/ReproducibleResearchPipelineTemplate/results/.git/
/results$ mkdir 2015-12-20-eda
/results$ git remote add origin git@github.com:ivanhanigan/ReproducibleResearchPipelineTemplate-results.git
$ git push -u origin master
The Result
- An example of these results are now published at https://github.com/ivanhanigan/ReproducibleResearchPipelineTemplate-results
- But the rest of my work is privately held, and version controlled.