* /home/
** /overview.org
- summary data_inventory
- DMP
** /worklog.org
- YYYY-MM-DD
* /projects/
** /project1_data_analysis_project_health_research
*** /dataset1_merged_health_outcomes_and_exposures
- index.org
- git (local private, gitignore all subfolders)
- workplan
- worklog
- workflow
- main.Rmd
**** /data1_provided
**** /data2_derived
***** - workflow script
**** /code
**** /results/ (this has all the pathways explored)
***** - README.md
- git (public Github)
/YYYY-MM-DD-shortname (i.e. EDA, prelim, model-selection, sensitivity)
/main.Rmd
/code/
/data/
**** /report/
/manuscript.Rmd
- main results recomputed in production/publication quality
- supporting_information (but also can refer to github/results)
/figures_and_tables/
- png
- csv
***** /journal_submission/
- cover letter
- approval signatures
- submitted manuscript
***** /journal_revision/
- response.org
** /project2_data_analysis_project_exposure_assessment
- index.org
- git
*** /dataset2.1_monitored_data
- workplan
- worklog
- workflow
**** /data1_provided
**** /data2_derived
- stored here or
- web2py crud or
- geoserver
/data1_and_data2_backups
/reports/
- manuscript.Rmd -> publish with the data somehow
/tools (R package)
- git/master -> Github
**** /dataset2.2_GIS_layers
** /methods_or_literature_review_project
* /tools/
/web2py
/applications
/data_inventory
- holdings
- prospective
/database_crud
/disentangle (R package)
/pipeline_templates
** /data/
/postgis_hanigan
/postgis_anu_gislibrary
/geoserver_anu_gislibrary
** /references/
- mendeley
- bib
- PDFs annotated
** /KeplerData/workflows/MyWorkflows/
*** /data_analysis_workflow_using_kepler (implemented as an R package)
**** /inst/doc/A01_load.R
*** /data_analysis_workflow_using_kepler (implemented as an R LCFD workflow)
- main.Rmd (raw R version)
- main.xml (this is kepler)
**** /data/
- file1.csv
- file2.csv
**** /code/
- load.R
name: my-framework-of-scientific-workflow-and-integration-software-for-holistic-data-analysis layout: post title: My framework of scientific workflow and integration software for holistic data analysis date: 2015-12-22 categories:
- data management
-
swish
Scientific workflow and integration software for holistic data analysis (SWISH) is a title I have given to describe the area of my research that focuses on the tools and techniques of reproducible data analysis.
Reproducibility is the ability to recompute the results of a data analysis with the original data. It is possible to have analyses that are reproducible with varying degrees of difficulty. A data analysis might be reproducible but require thousands of hours of work to piece together the datasets, transformations, manipulations, calculations and interpretations of computational results. A primary challenge to reproducible data analysis is to make analyses that are easy to reproduce.
To achieve this, a guiding principle is that analysts should effectively implement ‘pipelines’ of method steps and tools. Data analysts should employ standardised and evidence-based methods based on conventions developed from many data analysts approaching the problems in a similar way, rather than each analyst configuring pipelines to suit particular individual or domain-specific preferences.
Planning and implementing a pipeline
It can be much easier to conceptualise a complicated data analysis method than to implement this as a reproducible research pipeline. The most effective way to implement a pipeline is by methodically tracking each of the steps taken, the data inputs needed and all the outputs of the step. If done in a disciplined way then the analyst or some other person could ‘audit’ the procedure easily and access the details of the pipeline they need to scrutinise.
Toward a standardised data analysis pipeline framework
In my own work I have tried a diverse variety of configurations based on things I have read and discussions I have had. Coming to the end of my PhD project I have reflected on the framework that I have arrived at and present this below as a schematic overview.