The scientific questions motivating my work explore the health effects of environmental changes. These include droughts, bushfires, woodsmoke, dust storms, heat waves and local environmental conditions. The research needed to disentangle health effects of environmental changes from social factors. Some of the findings were novel and unexpected. Adequate documentation of the methods was problematic because of the many steps of data processing and analysis. Reproducible research pipelines address the problem of documenting data analyses by distributing data and code with publications.
Reproducibility is needed to improve credibility. It is often asserted in the literature that much research is not easy to reproduce. It is not clear what an effective way to implement these techniques is. The thesis asks how pipelines can be effectively implemented in epidemiology. It describes methods for reproducible research pipelines. It also demonstrates several applications of these methods in environmental epidemiology.
Environmental epidemiology requires us to study multifactorial pathogenesis. All diseases have multiple causal factors. To understand the many factors affecting health, epidemiologists must disentangle strands of a web of causal influences. Isolating factors is difficult and risks being overly reductionist. These determinants can interact in complex ways. Environmental epidemiologists often narrow the focus to a single environmental cause and health effect. A simple example is bushfire smoke and direct effects on cardio-respiratory disease. A more complex example is drought and suicide where the effects are indirect. The focus is on a chain of intermediary causal factors. These questions are usually explored in the context of many other factors that describe human biological variables and the socio-economic milieu.
While there is greater weight given to evidence from experimental than observational studies, experiments are difficult in environmental health. Analysis of observational data is often used instead. There are problems inherent in observational studies that pertain to variables that are confounders and effect modifiers. Observational studies face the principal problem of a large number of inter-relationships between variables. These can confound or modify effects. It is vital to a valid analysis and meaningful interpretation that we include these. It is problematic that scientists select variables from a multitude of possibilities found in the literature. Scientists also gather variables from a plethora of possible data sources. There is a long process of hypothesising, study design, data collection, cleaning, exploration, decision making, preparation, data analysis, model building and model checking. This process has been described as a vast ‘garden of forking paths’ which connect steps and decisions the analyst must make, but they could have made others. These issues might result in mere correlation interpreted as causation.
Adequately documenting the methods and results of data analysis helps safeguard against such mistakes. This thesis proposes that reproducible research pipelines address the problem of adequate documentation of data analysis. This is because they make it easy to check the methods. Assumptions are easy to challenge and results verified in new analyses. Reproducible research pipelines extend traditional research. They do this by encoding the steps in a computer ‘scripting’ language and distributing the data and code with publications. Traditional research moves through the steps of hypothesis and design, measured data, analytic data, computational results (for figures, tables and numerical results), and reports (text and formatted manuscript).