Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

The Shane-Weiss-Reich-White.worg approach to Code Management

Introduction

I’ve been thinking alot about workflows recently. I’m talking about the data, code, decisions etc bound up in the flow of material going through any project in the collective program of work we have going on at the Centre I work at. The group are facing tough questions about how we do things; and why. So in my reflections I’ve reviewed some links I’d saved and present below a unified summary version called the…

Shane-Weiss-Reich-White.worg approach

This a synthesis I’ve put together of approaches to managing code in complex data analysis projects. It’s named after key exponents on various blogs, wikis and web Q-and-A sites.

Stackoverflow user Shane posted this excellent comment to stackoverflow to:

“start off with one R file as you start a project (or a set of files like in the Bernd Weiss and Josh Reich examples), and progressively add to it (so that it grows in size) as you make discoveries.”

Bernd Weiss’ projects have:

  • analysis,
  • data and
  • document directories and
  • README.org (an Emacs org-mode file).

Bernd and Jeromy Anglim had an interesting discussion about this workflow in this post at stackexchange. Especially note that Bernd recommends that every publication, presentation or semester/class etc. has its own git repository. BUT that “there is one real downside: using the same dataset in different publications means to maintain different versions of ‘initialization code’ (define missing values, generate new variables etc.). To overcome this problem, Bernd decided to maintain ONE study/dataset-related repository which contains the original init-file. For each publication, presentation etc. use a copy of the original data-file as well as of the init-file (in R via file.copy()). Of course, whenever you create a new variable you’ll need to modify the original init-file and do a file.copy() (which is the most annoying part of the approach).”

Josh Reich breaks projects into 4 pieces:

  • load.R,
  • clean.R,
  • func.R and
  • do.R

John Myles White’s leads the ProjectTemplate package that has ‘create.project(minimal = TRUE)’ which creates the layout:

  • cache,
  • config,
  • data,
  • munge,
  • src, and
  • README

I’ve just added reports. If a project is a little bit bigger than minimal I’ll add admin, metadata, versions etc etc. I contributed that idea to the ProjectTemplate discussion list… but those guys seem to mostly use the default minimal = FALSE which creates all the possible directories including reports. I’ll try to keep it simple and just bolt on whatever bits suit my needs as I go.

Which Code Editor is the Best?

And finally the meta work holding the project together is the code editor. Despite the old joke which describes Emacs as “a great operating system, lacking only a decent editor”, this editor has killer functions for managing code. Check out worg the Emacs Org-Mode Community. Recently proponents of worg wrote this article. Previously I’ve REALLY enjoyed NPPtoR (only available under windoof).

In the words of JD Long in response to Shane “The choice of the specific tool is more idiosyncratic and not near as important as using SOMETHING.”

Posted in  disentangle things


blog comments powered by Disqus