Disentangle Things by Ivan Hanigan

Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

A quick review of a quick guide to organizing computational biology projects

The organisation of material is a particularly vexatious topic. For a data analysis project it is very important that the set of folders and files is logical and intuitive, as well as being well documented. The oft-heard exhortation by computer scientists to their users to ‘Read The F-ing Manual’ (RTFM) is perennial and rooted in the fundamental difficulty of readers to have the time required to read and digest the detailed information there-in.

In this post I review a paper that I was referred to by recent activity on the Mozillascience studyGroupLesson ‘Open Science Utility Belt’: https://github.com/mozillascience/studyGroupLessons/issues/7
That group also holds a journal club https://github.com/minisciencegirl/studyGroup/issues/20#issuecomment-134750483 and they reviewed this paper:
Noble, W. S. (2009). A quick guide to organizing computational biology projects. PLoS Computational Biology, 5(7), 1–5. http://dx.doi.org/10.1371/journal.pcbi.1000424

I missed out so I thought I’d put my notes up here for reference:

core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why
your future self may find it difficult to understand your current work.
Noble’s law: ‘Everything you do, you will probably have to do over again’
store all of the files relevant to one project in common root directory
The exception to this rule is data/code that are used in multiple projects, they are standalone projects
Within a given project, use a top-level organization that is logical first, then chronological at the next level, and then logical organization next
Core folders are data, results, doc (versus Berndt Weiss’ dat, ana, doc)
Chronological order? ‘tempting to apply a similar, logical organization… this approach is risky, because the logical structure of your final set of experiments may look drastically different from the form you initially designed. This is particularly true under the results directory, where you may not even know in advance what kinds of experiments you will need to perform’

Recommended folder and file structures http://dx.doi.org/10.1371/journal.pcbi.1000424.g001

/projectname (eg msms)/
    /doc/
        /ms-analysis.html 
        /paper/
            /msms.tex
            /msms.pdf
    /data/
        /YYYY-MM-DD/
            /yeast/
                /README
                /yeast.sqt
            /worm/
                /README
                /worm.sqt
    /src/
        /ms-analysis.c
    /bin/
        /parse-sqt.py
    /results/
        /notebook.html 
        /YYYY-MM-DD-1/
            /runall
            /split1/
            /split2/
        /YYYY-MM-DD-2/
            /runall

Use a driver script to automate creation of a directory structure
maintain a chronologically organized lab notebook (I have been calling this a work ‘log’ sensu Scott Long’s 2008 ‘Workflow book’)
create either a README file, or a command line driver script (he calls this runall, but see also main.R sensu the Reichian LCFD model)
you should end up with a file that is parallel to the lab notebook entry. The lab notebook contains a prose description of the exper- iment, whereas the driver script contains all the gory details
Version Control. ‘Nuff said! But how to build capacity with Github when all my colleagues seem so confused by it?