Recap on this series of three posts
- The first post showed the recommended files and folders for a data analysis project from Scott Long
- That recommendation was pretty complex, with a few folders that I felt did not jump out as super-useful
- The second post showed a very simple template from the R community called makeProject
- I really like that one as it seems to be the minimum amount of stuff needed to make things work.
The ProjectTemplate framework
- I have been using John Myles Whites ProjectTemplate R package http://projecttemplate.net/ for ages
- I really like the ease with which I can get up and running a new project
- and the ease with which I can pick up an old project and start adding new work
Quote from John’s first post
My inspiration for this approach comes from the rails command from
Ruby on Rails, which initializes a new Rails project with the proper
skeletal structure automatically. Also taken from Rails is
ProjectTemplate’s approach of preferring convention over
configuration: the automatic data and library loading as well as the
automatic testing work out of the box because assumptions are made
about the directory structure and naming conventions that will be used
http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/
- I dont know anything about RoR but this philosophy works really well for my R programming too
R Code
if(!require(ProjectTemplate)) install.packages(ProjectTemplate); require(ProjectTemplate)
setwd("~/projects")
create.project("my-project")
setwd('my-project')
dir()
## [1] "cache" "config" "data" "diagnostics" "doc"
## [6] "graphs" "lib" "logs" "munge" "profiling"
## [11] "README" "reports" "src" "tests" "TODO"
##### these are very sensible default directories to create a modular
##### analysis workflow. See the project homepage for descriptions
# now all you need to do whenever you start a new day
load.project()
# and your workspace will be recreated and any new data automagically analysed in
# the manner you want
Advanced usage of ProjectTemplate
ProjectTemplate Demo
Table of Contents
- 1 The Compendium concept
- 2 The R code that produced this report
- 3 Inititalise R environment
- 4 ProjectTemplate
- 5 Why?
- 6 The Reichian load, clean, func, do approach
- 7 The Peng NMMAPSlite approach
- 8 Init the project
- 9 dir()
- 10 The reports directory
- 11 Do the analysis
- 12 Get the projecttemplate tutorial data
- 13 Tools
- 14 Load the analysis data
- 15 check the analysis data
- 16 Develop munge code
- 17 To munge or not to munge?
- 18 Cache
- 19 Plot first and second letter counts
- 20 Do generate plots
- 21 First letter
- 22 Second letter
- 23 Report results
- 24 Produce final report
- 25 Personalised project management directories
1 The Compendium concept
\section{The Compendium concept} My goal is to develop data analysis projects along the lines of the Compendium concept of Gentleman and Temple Lang (2007) \cite{Gentleman2007}. Compendia are dynamic documents containing text, code and data. Transformations are applied to the compendium to view its various aspects.
- Code Extraction (Tangle): source code
- Export (Weave): LaTeX, HTML, etc
- Code Evaluation
I'm also following the orgmode technique of Schulte et al (2012) \cite{Schulte}
2 The R code that produced this report
I support the philosophy of Reproducible Research http://www.sciencemag.org/content/334/6060/1226.full, and where possible I provide data and code in the statistical software R that will allow analyses to be reproduced. This document is prepared automatically from the associated Emacs Orgmode file. If you do not have access to the Orgmode file please contact me.
cat(' ####################################################################### ## The R code is free software; please cite this paper as the source. ## Copyright 2012, Ivan C Hanigan <ivan.hanigan@gmail.com> ## This program is free software; you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation; either version 2 of the License, or ## (at your option) any later version. ## ## This program is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. ## Free Software ## Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA ## 02110-1301, USA ####################################################################### ')
3 Inititalise R environment
#### # MAKE SURE YOU HAVE THE CORE LIBS if (!require(ProjectTemplate)) install.packages('ProjectTemplate', repos='http://cran.csiro.au'); require(ProjectTemplate) if (!require(lubridate)) install.packages('lubridate', repos='http://cran.csiro.au'); require(lubridate) if (!require(reshape)) install.packages('reshape', repos='http://cran.csiro.au'); require(reshape) if (!require(plyr)) install.packages('plyr', repos='http://cran.csiro.au'); require(plyr) if (!require(ggplot2)) install.packages('ggplot2', repos='http://cran.csiro.au'); require(ggplot2) if(!require(mgcv)) install.packages('mgcv', repos='http://cran.csiro.au');require(mgcv); require(splines) if(!require(NMMAPSlite)) install.packages('NMMAPSlite', repos='http://cran.csiro.au');require(NMMAPSlite) rootdir <- getwd()
4 ProjectTemplate
\section{ProjectTemplate} This is a simple demo of the R package \emph{ProjectTemplate} http://projecttemplate.net/ which is aimed at standardising the structure and general development of data analysis projects in R. A primary aim is to allow analysts to quickly get a project loaded up and ready to:
- reproduce or
- create new data analyses.
5 Why?
It has been recognised on the R blogosphere that it
- is ``meant to handle very complex research projects'' (http://bryer.org/2012/maker-an-r-package-for-managing-document-building-and-versioning) and
- is considered as being amongst the best approaches to the workflow for doing data analysis with R (http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html)
6 The Reichian load, clean, func, do approach
\section{The Reichian load, clean, func, do approach}
The already mentioned blog post http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html also links to another`best' approach, the:
- \emph{Reichian load, clean, func, do} approach http://stackoverflow.com/a/1434424.
By Josh Reich. I've also followed to prepare this demo using the tutorial and data from the package website http://projecttemplate.net/getting_started.html
7 The Peng NMMAPSlite approach
\section{The Peng NMMAPSlite approach} The other approach I followed was that of Roger Peng from Johns Hopkins and his NMMAPSlite R package \cite{Peng2004}. Especially the function
readCity(name, collapseAge = FALSE, asDataFrame = TRUE)
Arguments
- name character, abbreviated name of a city
- collapseAge logical, should age categories be collapsed?
- asDataFrame logical, should a data frame be returned?)
Description: Provides remote access to daily mortality, weather, and air pollution data from the National Morbidity, Mortality, and Air Pollution Study for 108 U.S. cities (1987–2000); data are obtained from the Internet-based Health and Air Pollution Surveillance System (iHAPSS)
8 Init the project
\section{Init the project} First we want to initialise the project directory.
#### # init require('ProjectTemplate') create.project('analysis',minimal=TRUE)
9 dir()
#### # init dir dir('analysis')
cache |
config |
data |
munge |
README |
src |
10 The reports directory
I've added the reports directory manually and asked the package author if this is generic enough to be in the defaults for
minimal = TRUE
I believe it may be as the \emph{Getting Started} guidebook states:
`It's meant to contain the sort of written descriptions of the results of your analyses that you'd \textbf{publish in a scientific paper.}
With that report written …, we've gone through \textbf{the simplest sort of analysis you might run with ProjectTemplate}.
#### # init reports dir.create('analysis/reports')
11 Do the analysis
\section{Do the analysis: use load,clean,func,do}
#### # this is the start of the analysis, # assumes the init.r file has been run if(file.exists('analysis')) setwd('analysis') Sys.Date() # keep a track of the dates the analysis is rerun getwd() # may want to keep a reference of the directory # the project is in so we can track the history
12 Get the projecttemplate tutorial data
Get the data from http://projecttemplate.net/letters.csv.bz2 (I downloaded on 13-4-2012) Put it in the data directory for auto loading.
#### # analysis get tutorial data download.file('http://projecttemplate.net/letters.csv.bz2', destfile = 'data/letters.csv.bz2', mode = 'wb')
13 Tools
Edit the \emph{config/global.dcf} file to make sure that the load_libraries setting is turned on
14 Load the analysis data
#\section{load}
#### # analysis load require(ProjectTemplate) load.project()
15 check the analysis data
#\section{clean}
tail(letters)
zyryan | z | y |
zythem | z | y |
zythia | z | y |
zythum | z | y |
zyzomys | z | y |
zyzzogeton | z | y |
16 Develop munge code
#\section{load with processing (munge)}
Edit the \emph{munge/01-A.R} script so that it contains the following two lines of code:
# For our current analysis, we're interested in the total # number of occurrences of each letter in the first and # second letter positions and not in the words themselves. # compute aggregates first.letter.counts <- ddply(letters, c('FirstLetter'), nrow) second.letter.counts <- ddply(letters, c('SecondLetter'), nrow)
Now if we run with
load.project()
all munging will happen automatically. However…
17 To munge or not to munge?
As you'll see on the website, once the data munging is completed and outputs cached, load.project() will keep recomputing work over and over. The author suggests we manually edit our configuration file.
# edit the config file and turn munge on # load.project() # edit the config file and turn munge off # or my preference source('munge/01-A.r') # which can be included in our first analysis script # but subsequent analysis scripts can just call load.project() # without touching the config file
18 Cache
Once munging is complete we cache the results
cache('first.letter.counts') cache('second.letter.counts') # And need to keep an eye on the implications for our config file to avoid re-calculating these next time we load.project()
#\section{do}
19 Plot first and second letter counts
Produce some simple density plots to see the shape of the first and second letter counts.
- Create \emph{src/generate_plots.R}. Use the src directory to store any analyses that you run.
- The convention is that every analysis script starts with load.project() and then goes on to do something original with the data.
20 Do generate plots
Write the first analysis script into a file in \textbf{src}
require('ProjectTemplate') load.project() plot1 <- ggplot(first.letter.counts, aes(x = V1)) + geom_density() ggsave(file.path('reports', 'plot1.pdf')) plot2 <- ggplot(second.letter.counts, aes(x = V1)) + geom_density() ggsave(file.path('reports', 'plot2.pdf')) dev.off()
And now run it (I do this from a main `overview' script).
source('src/generate_plots.r')
21 First letter
22 Second letter
23 Report results
\section{Report results} We see that both the first and second letter distributions are very skewed. To make a note of this for posterity, we can write up our discovery in a text file that we store in the reports directory.
\documentclass[a4paper]{article} \title{Letters analysis} \author{Ivan Hanigan} \begin{document} \maketitle blah blah blah \end{document}
24 Produce final report
# now run LaTeX on the file in reports/letters.tex
25 Personalised project management directories
\section{Personalised project management directories}
#### # init additional directories for project management analysisTemplate()
dir()
admin |
analysis |
data |
document |
init.r |
metadata |
ProjectTemplateDemo.org |
references |
tools |
versions |