This post is about an effective and simple data management framework for analysis projects. This post introduces Josh Reich’s LCFD framework, originally introduced in this answer on the stack overflow website here http://stackoverflow.com/a/1434424, and encoded into the makeProject R package http://cran.r-project.org/web/packages/makeProject/makeProject.pdf.
Literature Review Approach
This series of three posts is a summary of some of the most useful advice I have found based on my experience having implemented in my own work.
This is the second post in a series of three entries regarding some evidence-based best practice approaches I have reviewed. I have read many website articles and blog posts on a variety of approaches to the organisation of digital assets in a reporoducible research pipeline. The material I’ve gathered in my ongoing search and opportunistic readings regarding best practice in this area have been recommended by practitioners which provides some weight of evidence. In addition I have implemented some aspects of the many techniques and the reproducibility of my own work has improved greatly.
Digital Assets Management for Reproducible Research
The digital assets in a reproducible research pipeline include:
- Publication material (documents, figures, tables, literature)
- Data (raw measurements, data provided, data derived)
- Code (pre-processing, analysis and presentation)
How to use the makeProject
package
- The
makeProject
R package is designed to create a folder and some R scripts that are useful for generic workflow tasks. - The theory is very similar to the approach described in the previous post about Scott Long’s batch script: wfsetupsingle.bat https://ivanhanigan.github.com/2015/09/reproducible-research-and-managing-digital-assets
Code:
# choose your project dir
setwd("~/projects")
library(makeProject)
makeProject("makeProjectDemo")
#returns
"Creating Directories ...
Creating Code Files ...
Complete ..."
matrix(dir("makeProjectDemo"))
#[1,] "code"
#[2,] "data"
#[3,] "DESCRIPTION"
#[4,] "main.R"
- This has set up some simple and sensible tools for a data analysis.
- Let’s have a look at the
main.R
script. This is the one file that is used to run all the modules of the project, found in the R scripts in thecode
folder.
Code:
# Project: makeProjectDemo
# Author: Your Name
# Maintainer: Who to complain to <yourfault@somewhere.net>
# This is the main file for the project
# It should do very little except call the other files
### Set the working directory
setwd("/home/ivan_hanigan/projects/makeProjectDemo")
### Set any global variables here
####################
####################
### Run the code
source("code/load.R")
source("code/clean.R")
source("code/func.R")
source("code/do.R")
I think that is very self-explanatory, but it does need some demonstration. The next instalment in this three part blog post will describe the ProjectTemplate approach. After that I will demonstrate ways that each of the three approaches can be used.