Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

tracking-a-data-analysis-pipeline

I have just uploaded a new version of the windows build for my ‘disentangle’ package. The blurb of the draft vignette is below.

Introduction

It can be much easier to conceptually understand a complicated data analysis pipeline than it is to implement that pipeline effectively. This report outlines the use of the ‘disentangle’ R package, available from http://ivanhanigan.github.io/projects.html. This package contains functions that were developed to aid data analysts to map out all the aspects of their work when planning and conducting complicated data analyses using the pipeline concept. There are often many steps in the design and analysis of a study and when these are put together as a data analysis pipeline this addresses the challenge of reproducibility (Peng 2006). The credibility of data analyses requires that every step is able to be scrutinised (Leek 2015).

Motivating scientific questions

The type of data analysis that is the focus of this work is more complicated than simply loading some data that are already cleaned, fitting some models and reporting some output. Typically the type of data analysis projects that these tools are aimed at involve attempts to control for a large number of inter-relationships and associations between variables. It is especially problematic that these variables need to have been selected by the scientists from a multitude of possible variables and a plethora of possible data sources, during a long process of data collection, cleaning, exploration and decision making in preparation for data analysis. There are also a multitude of steps and decision points in the process of model building and model checking. The use of statistical models involving many entangled environmental and social variables can easily result in spurious association that may be mistakenly interpreted as causation. Projects that the author has been involved in include explorations of hypotheses about health effects from droughts, bushfire smoke, heat-waves and dust-storms which produced novel findings, and informed controversial debates about the implications of climate change. The requirement to adequately convey the methods and results of this research was problematic and motivated the work on effective use of reproducible research techniques and data analysis pipelines.

Posted in  disentangle


blog comments powered by Disqus