Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Reproducible Research And Managing Digital Assets Part 3 of 3. ProjectTemplate is appropriate for large scale

Recap on this series of three posts

  • The first post showed the recommended files and folders for a data analysis project from Scott Long
  • That recommendation was pretty complex, with a few folders that I felt did not jump out as super-useful
  • The second post showed a very simple template from the R community called makeProject
  • I really like that one as it seems to be the minimum amount of stuff needed to make things work.

The ProjectTemplate framework

  • I have been using John Myles Whites ProjectTemplate R package http://projecttemplate.net/ for ages
  • I really like the ease with which I can get up and running a new project
  • and the ease with which I can pick up an old project and start adding new work

Quote from John’s first post

My inspiration for this approach comes from the rails command from
Ruby on Rails, which initializes a new Rails project with the proper
skeletal structure automatically. Also taken from Rails is
ProjectTemplate’s approach of preferring convention over
configuration: the automatic data and library loading as well as the
automatic testing work out of the box because assumptions are made
about the directory structure and naming conventions that will be used

http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/

  • I dont know anything about RoR but this philosophy works really well for my R programming too

R Code

if(!require(ProjectTemplate)) install.packages(ProjectTemplate); require(ProjectTemplate)
setwd("~/projects")
create.project("my-project")
setwd('my-project')
dir()
##  [1] "cache"       "config"      "data"        "diagnostics" "doc"        
##  [6] "graphs"      "lib"         "logs"        "munge"       "profiling"  
## [11] "README"      "reports"     "src"         "tests"       "TODO"   
##### these are very sensible default directories to create a modular
##### analysis workflow.  See the project homepage for descriptions
 
# now all you need to do whenever you start a new day 
load.project()
# and your workspace will be recreated and any new data automagically analysed in
# the manner you want

Advanced usage of ProjectTemplate

ProjectTemplate Demo

1 The Compendium concept

\section{The Compendium concept} My goal is to develop data analysis projects along the lines of the Compendium concept of Gentleman and Temple Lang (2007) \cite{Gentleman2007}. Compendia are dynamic documents containing text, code and data. Transformations are applied to the compendium to view its various aspects.

  • Code Extraction (Tangle): source code
  • Export (Weave): LaTeX, HTML, etc
  • Code Evaluation

I'm also following the orgmode technique of Schulte et al (2012) \cite{Schulte}

2 The R code that produced this report

I support the philosophy of Reproducible Research http://www.sciencemag.org/content/334/6060/1226.full, and where possible I provide data and code in the statistical software R that will allow analyses to be reproduced. This document is prepared automatically from the associated Emacs Orgmode file. If you do not have access to the Orgmode file please contact me.

cat('
 #######################################################################
 ## The R code is free software; please cite this paper as the source.  
 ## Copyright 2012, Ivan C Hanigan <ivan.hanigan@gmail.com> 
 ## This program is free software; you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
 ## the Free Software Foundation; either version 2 of the License, or
 ## (at your option) any later version.
 ## 
 ## This program is distributed in the hope that it will be useful,
 ## but WITHOUT ANY WARRANTY; without even the implied warranty of
 ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 ## GNU General Public License for more details.
 ## Free Software
 ## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
 ## 02110-1301, USA
 #######################################################################
')

3 Inititalise R environment

####
# MAKE SURE YOU HAVE THE CORE LIBS
if (!require(ProjectTemplate)) install.packages('ProjectTemplate', repos='http://cran.csiro.au'); require(ProjectTemplate)
if (!require(lubridate)) install.packages('lubridate', repos='http://cran.csiro.au'); require(lubridate)
if (!require(reshape)) install.packages('reshape', repos='http://cran.csiro.au'); require(reshape)
if (!require(plyr)) install.packages('plyr', repos='http://cran.csiro.au'); require(plyr)
if (!require(ggplot2)) install.packages('ggplot2', repos='http://cran.csiro.au'); require(ggplot2)
if(!require(mgcv)) install.packages('mgcv', repos='http://cran.csiro.au');require(mgcv);
require(splines)
if(!require(NMMAPSlite)) install.packages('NMMAPSlite', repos='http://cran.csiro.au');require(NMMAPSlite)
rootdir <- getwd()  

4 ProjectTemplate

\section{ProjectTemplate} This is a simple demo of the R package \emph{ProjectTemplate} http://projecttemplate.net/ which is aimed at standardising the structure and general development of data analysis projects in R. A primary aim is to allow analysts to quickly get a project loaded up and ready to:

  • reproduce or
  • create new data analyses.

5 Why?

It has been recognised on the R blogosphere that it

6 The Reichian load, clean, func, do approach

\section{The Reichian load, clean, func, do approach}

The already mentioned blog post http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html also links to another`best' approach, the:

By Josh Reich. I've also followed to prepare this demo using the tutorial and data from the package website http://projecttemplate.net/getting_started.html

7 The Peng NMMAPSlite approach

\section{The Peng NMMAPSlite approach} The other approach I followed was that of Roger Peng from Johns Hopkins and his NMMAPSlite R package \cite{Peng2004}. Especially the function

readCity(name, collapseAge = FALSE, asDataFrame = TRUE)

Arguments

  • name character, abbreviated name of a city
  • collapseAge logical, should age categories be collapsed?
  • asDataFrame logical, should a data frame be returned?)

Description: Provides remote access to daily mortality, weather, and air pollution data from the National Morbidity, Mortality, and Air Pollution Study for 108 U.S. cities (1987–2000); data are obtained from the Internet-based Health and Air Pollution Surveillance System (iHAPSS)

8 Init the project

\section{Init the project} First we want to initialise the project directory.

####
# init
require('ProjectTemplate')
create.project('analysis',minimal=TRUE)

9 dir()

####
# init dir
dir('analysis')
cache
config
data
munge
README
src

10 The reports directory

I've added the reports directory manually and asked the package author if this is generic enough to be in the defaults for

minimal = TRUE 

I believe it may be as the \emph{Getting Started} guidebook states:

`It's meant to contain the sort of written descriptions of the results of your analyses that you'd \textbf{publish in a scientific paper.}

With that report written …, we've gone through \textbf{the simplest sort of analysis you might run with ProjectTemplate}.

####
# init reports
dir.create('analysis/reports')

11 Do the analysis

\section{Do the analysis: use load,clean,func,do}

####
# this is the start of the analysis, 
# assumes the init.r file has been run
if(file.exists('analysis')) setwd('analysis')  
Sys.Date()
# keep a track of the dates the analysis is rerun
getwd()
# may want to keep a reference of the directory 
# the project is in so we can track the history 

12 Get the projecttemplate tutorial data

Get the data from http://projecttemplate.net/letters.csv.bz2 (I downloaded on 13-4-2012) Put it in the data directory for auto loading.

####
# analysis get tutorial data
download.file('http://projecttemplate.net/letters.csv.bz2', 
  destfile = 'data/letters.csv.bz2', mode = 'wb')

13 Tools

Edit the \emph{config/global.dcf} file to make sure that the load_libraries setting is turned on

14 Load the analysis data

#\section{load}

####
# analysis load
require(ProjectTemplate)
load.project()

15 check the analysis data

#\section{clean}

tail(letters)
zyryanzy
zythemzy
zythiazy
zythumzy
zyzomyszy
zyzzogetonzy

16 Develop munge code

#\section{load with processing (munge)}

Edit the \emph{munge/01-A.R} script so that it contains the following two lines of code:

# For our current analysis, we're interested in the total 
# number of occurrences of each letter in the first and 
# second letter positions and not in the words themselves.
# compute aggregates
first.letter.counts <- ddply(letters, c('FirstLetter'), 
  nrow)
second.letter.counts <- ddply(letters, c('SecondLetter'), 
  nrow)

Now if we run with

load.project()

all munging will happen automatically. However…

17 To munge or not to munge?

As you'll see on the website, once the data munging is completed and outputs cached, load.project() will keep recomputing work over and over. The author suggests we manually edit our configuration file.

 # edit the config file and turn munge on
 # load.project()
 # edit the config file and turn munge off
 # or my preference
 source('munge/01-A.r')
# which can be included in our first analysis script
# but subsequent analysis scripts can just call load.project() 
# without touching the config file

18 Cache

Once munging is complete we cache the results

cache('first.letter.counts')
cache('second.letter.counts')

# And need to keep an eye on the implications for our config file to avoid re-calculating these next time we 

load.project()

#\section{do}

19 Plot first and second letter counts

Produce some simple density plots to see the shape of the first and second letter counts.

  • Create \emph{src/generate_plots.R}. Use the src directory to store any analyses that you run.
  • The convention is that every analysis script starts with load.project() and then goes on to do something original with the data.

20 Do generate plots

Write the first analysis script into a file in \textbf{src}

require('ProjectTemplate')
load.project()
plot1 <- ggplot(first.letter.counts, aes(x = V1)) + 
  geom_density()
ggsave(file.path('reports', 'plot1.pdf'))

plot2 <- ggplot(second.letter.counts, aes(x = V1)) + 
  geom_density()
ggsave(file.path('reports', 'plot2.pdf'))
dev.off()

And now run it (I do this from a main `overview' script).

source('src/generate_plots.r')

21 First letter

22 Second letter

23 Report results

\section{Report results} We see that both the first and second letter distributions are very skewed. To make a note of this for posterity, we can write up our discovery in a text file that we store in the reports directory.

\documentclass[a4paper]{article}
\title{Letters analysis}
\author{Ivan Hanigan}
\begin{document}
\maketitle
blah blah blah
\end{document}


24 Produce final report

# now run LaTeX on the file in reports/letters.tex

25 Personalised project management directories

\section{Personalised project management directories}

####
# init additional directories for project management
analysisTemplate()
dir()
admin
analysis
data
document
init.r
metadata
ProjectTemplateDemo.org
references
tools
versions

Posted in  Data Management


blog comments powered by Disqus