Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

My Framework Of Scientific Workflow And Integration Software For Holistic Data Analysis

*   /home/
**    /overview.org 
           - summary data_inventory
           - DMP
**    /worklog.org    
           - YYYY-MM-DD
*   /projects/
**    /project1_data_analysis_project_health_research
***       /dataset1_merged_health_outcomes_and_exposures
             - index.org
             - git (local private, gitignore all subfolders)
             - workplan
             - worklog
             - workflow
             - main.Rmd
****         /data1_provided
****         /data2_derived
*****            - workflow script
****         /code
****         /results/  (this has all the pathways explored)
*****           - README.md
                 - git (public Github)
                 /YYYY-MM-DD-shortname (i.e. EDA, prelim, model-selection, sensitivity)
                     /main.Rmd
                     /code/
                     /data/
****         /report/
                   /manuscript.Rmd
                     - main results recomputed in production/publication quality
                     - supporting_information (but also can refer to github/results)
                 /figures_and_tables/
                     - png
                     - csv
*****           /journal_submission/
                     - cover letter
                     - approval signatures
                     - submitted manuscript
*****           /journal_revision/
                     - response.org
**    /project2_data_analysis_project_exposure_assessment
           - index.org
           - git
***       /dataset2.1_monitored_data
              - workplan
              - worklog
              - workflow
****         /data1_provided
****         /data2_derived 
                 - stored here or
                 - web2py crud or
                 - geoserver
              /data1_and_data2_backups
              /reports/
                 - manuscript.Rmd -> publish with the data somehow
              /tools (R package)
                 - git/master -> Github
****      /dataset2.2_GIS_layers 
**    /methods_or_literature_review_project
*  /tools/
         /web2py
             /applications
                 /data_inventory
                     - holdings
                     - prospective
                 /database_crud
          /disentangle (R package)
          /pipeline_templates
**   /data/
         /postgis_hanigan
         /postgis_anu_gislibrary
         /geoserver_anu_gislibrary
**   /references/
         - mendeley
         - bib
         - PDFs annotated
**   /KeplerData/workflows/MyWorkflows/
***      /data_analysis_workflow_using_kepler (implemented as an R package)
****         /inst/doc/A01_load.R
***      /data_analysis_workflow_using_kepler (implemented as an R LCFD workflow)
             - main.Rmd (raw R version)
             - main.xml (this is kepler)
****         /data/
                 - file1.csv
                 - file2.csv
****         /code/
                 - load.R

name: my-framework-of-scientific-workflow-and-integration-software-for-holistic-data-analysis layout: post title: My framework of scientific workflow and integration software for holistic data analysis date: 2015-12-22 categories:

  • data management
  • swish

Scientific workflow and integration software for holistic data analysis (SWISH) is a title I have given to describe the area of my research that focuses on the tools and techniques of reproducible data analysis.

Reproducibility is the ability to recompute the results of a data analysis with the original data. It is possible to have analyses that are reproducible with varying degrees of difficulty. A data analysis might be reproducible but require thousands of hours of work to piece together the datasets, transformations, manipulations, calculations and interpretations of computational results. A primary challenge to reproducible data analysis is to make analyses that are easy to reproduce.

To achieve this, a guiding principle is that analysts should effectively implement ‘pipelines’ of method steps and tools. Data analysts should employ standardised and evidence-based methods based on conventions developed from many data analysts approaching the problems in a similar way, rather than each analyst configuring pipelines to suit particular individual or domain-specific preferences.

Planning and implementing a pipeline

It can be much easier to conceptualise a complicated data analysis method than to implement this as a reproducible research pipeline. The most effective way to implement a pipeline is by methodically tracking each of the steps taken, the data inputs needed and all the outputs of the step. If done in a disciplined way then the analyst or some other person could ‘audit’ the procedure easily and access the details of the pipeline they need to scrutinise.

Toward a standardised data analysis pipeline framework

In my own work I have tried a diverse variety of configurations based on things I have read and discussions I have had. Coming to the end of my PhD project I have reflected on the framework that I have arrived at and present this below as a schematic overview.

Posted in 


blog comments powered by Disqus