Disentangle Things by Ivan Hanigan

Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

Notes from Dr Climate Re data reference syntax models for file organisation and naming

This is an excellent explanation of the Australian Integrated Marine Observing System (IMOS) Data Reference Syntax by Damien Irving on the Dr Climate blog https://drclimate.wordpress.com/2015/09/04/managing-your-data/
A Data Reference Syntax (DRS) – a convention for naming your files

<computer>/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/

The data type has a sub-DRS of its own, which tells us that the data
represents the 1-hourly average surface current for a single month
(October 2012), and that it is archived on a regularly spaced spatial
grid and has not been quality controlled.

Just in case the file gets separated from this informative directory
structure, much of the information is repeated in the file name
itself, along with some more detailed information about the start and
end time of the data, and the last time the file was modified:

<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz

In the first instance this level of detail seems like a bit of
overkill... 

Since the data are so well labelled,
locating all monthly timescale ACORN data from the Turquoise Coast and
Rottnest Shelf sites (which represents hundreds of files) would be as
simple as typing the following at the command line:

$ ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc

Damien’s personalised DRS

It is worthwhile thinking through these ideas and incorporating them in ones data management system as early as possible
Damien has also helpfully openly shared his own DRS at https://github.com/DamienIrving/climate-analysis/blob/master/data_reference_syntax.md
Here is a summary of some key items I’m going to implement versions of for my own work

Basic data files

<var>_<dataset>_<level>_<time>_<spatial>.nc

Sub-categories:

<time>: <tstep>-<aggregation>-<season>
<spatial>: <grid>-<region>-<bounds>-<np>

Where:

<tstep>: daily, monthly
<aggregation>: 030day-runmean, anom-wrt-1979-2011, anom-wrt-all
<season>: JJA, MJJASO
<grid>: native or something like y181x360, which describes the number of latitude (181) and longitude (360) points (in this case it is a 1 by 1 degree horizontal grid).
<region>: Region names are defined in netcdf_io.py
<bounds>: e.g. lon225E335E-lat10S10N or mermax, zonal-anom
<np>: North pole location, e.g. np20N260E

Examples include:
psl_Merra_surface_daily_y181x360.nc

More complex file names

<inside>_<filters>_<prev-var>_<dataset>_<level>_<time>_<spatial>.nc

Sub-categories:

<inside>: The variable inside the file. e.g. tas-composite, datelist
<filters>: e.g. samgt90pct (gt and lt and used for greater and less than, pct for percentile)
<prev-var>: if it’s not obvious what variable <inside> was created from, include the previous variable/s

Examples:
tas-composite_pwigt90pct_ERAInterim_500hPa_030day-runmean-anom-wrt-all_native-sh.png

Principles of Tidy Data

In the words of Hadley Wickham the order that data should be arranged in follows some generic principles:

'A good ordering makes it easier to scan the raw values. One way of
organizing variables is by their role in the analysis: are values
fixed by the design of the data collection, or are they measured
during the course of the experiment? Fixed variables describe the
experimental design and are known in advance. Computer scientists
often call fixed variables dimensions, and statisticians usually
denote them with subscripts on random variables. Measured variables
are what we actually measure in the study. Fixed variables should come
first, followed by measured variables, each ordered so that related
variables are contiguous. Rows can then be ordered by the first
variable, breaking ties with the second and subsequent (fixed)
variables.'

An exemplar

In my last project the protocol we developed (for an ecology and biodiversity database) had a naming convention which relied heavily on a sequence of information being used to order the names of folders, subfolders and files. This is:

The project name (and optional sub-project name)
Data type (such as experimental unit, observational unit, and/or measurement methods)
Geographic location (locality name, State, Country)
Temporal frequency and coverage (such as annual or seasonal tranches).

The concepts of slow moving dimensions and fast moving variables

The concept of dimensions and variables can be useful here, and especially for deciding on filenames. Dimensions are fixed or change slowly while variables change more quickly. By ‘change’, this means that there are more of them. For example the project name is ‘fixed’, that is it does not change across the files, but the sub-project name does change, just more slowly (say there may be 2-3 different sub-projects within a project). Then there may be a set of data types, and these ‘change’ more quickly than the sub-project name. Then the geographic and temporal variables might change quickest of all.

So a general rule for the order of things can be stated. The fixed and slowly changing variables should come first (those things that don’t change, or don’t change much), followed by the more fluid variables (or things that change more across the project). List elements can then be ordered so that the groups of things that are similar will always be contiguous, and vary sequentially within clusters.

So the only thing I disagree with Damien about is his decision to put space after time:

<var>_<dataset>_<level>_<time>_<spatial>.nc

This is because I think that the geography is more stable than the time period for a data collection, and as most of my studies look at changes of variables measured at a location over time I generally want to compare the same spot at multiple times. There are pros and cons of each approach such as if the analyst wants to make maps of a variable measured at several locations at a single point in time then having the data arranged by time first and then location may make that job simpler.

I also notice however that the IMOS syntax puts the site spatial location before the year.

Posted in disentangle