Disentangle Things by Ivan Hanigan

Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

using-additional-header-rows-for-metadata

Table of Contents

Comment on eMast recommendations
- - Introduction
  - R code

Comment on eMast recommendations

ivan.hanigan@anu.edu.au

Introduction

I was lucky to be forwarded a copy of the document “DRAFT Best practices for collecting, processing and collating plant trait data” (V0.0).
What I like most about this is the statement on page nine under “3. Best practice collection techniques” that “Datasets should be maintained following two simple practices of formatting and cataloguing”.
I think the recomendations are very sensible but I have the following comment regarding the proposed file structure shown in the table

I have not used the second row for the units before, but rather have encoded this information in a second metadata file that I keep with the main data file.
This second row does seem attractive
BUT as this might be interpreted as the first row of data after the header row of column names this needs extra code to be written to allow importation to statistics packages.
While this can be easily handled by writing extra code to treat this first row separately, this does seem a bit risky to expect ordinary data users to do so.
I wonder if the intention of the authors is to include this in the column NAME rather than just in the column as the statement currently reads (“the units of measurement and an expanded definition of the data recorded in each column”) and the table shows?
For example the R code given in the Appendix “R script to aggregate unprocessed trait data into summary statistics ready for the EMP DATABASE”

R code

The eMast Document provides an interesting appendix with R codes.
The following is an attempt to make a vignette that will run with the example data provided.

Construct some fake data


dat <- read.csv(textConnection("Date    ,Latitude,Longitude,Genus,Species,Tree No.,Meas. No.,Photosynthesis,Air Temp.,Height\n         ,      oS,       oE,          ,       ,   , ,umol m-2 s-1, oC, m\n1/10/2004,-43.4444,140.1453 ,Eucalyptus,Saligna,  1,1,15.043,25.56,15\n1/10/2004,-43.4444,140.1453 ,Eucalyptus,Saligna,  1,2,15.998,25.56,15\n1/10/2004,-43.4444,140.1453 ,Eucalyptus,Saligna,  1,3,15.584,25.56,15\n"))
# write a couple of fake data files
for (i in 1:2) {
    write.csv(dat, paste("Book", i, ".csv", sep = ""), row.names = F)
}
# show me the data
print(xtable(dat), type = "html")

	Date	Latitude	Longitude	Genus	Species	Tree.No.	Meas..No.	Photosynthesis	Air.Temp.	Height
1		oS	oE					umol m-2 s-1	oC	m
2	1/10/2004	-43.4444	140.1453	Eucalyptus	Saligna	1	1	15.043	25.56	15
3	1/10/2004	-43.4444	140.1453	Eucalyptus	Saligna	1	2	15.998	25.56	15
4	1/10/2004	-43.4444	140.1453	Eucalyptus	Saligna	1	3	15.584	25.56	15

Run the aggregation of trait data program provided in eMast doc


library(stringr)
# This function summarises the data, one just needs to past the parameter of
# interest and the quantities by which it varies around.
do.sumy <- function(pars, lab, dat) {
    epars <- as.formula(pars)
    mu <- aggregate(epars, data = dat, mean)
    md <- aggregate(epars, data = dat, median)
    se <- aggregate(epars, data = dat, sd)
    mx <- aggregate(epars, data = dat, min)
    mn <- aggregate(epars, data = dat, max)
    NN <- aggregate(epars, data = dat, length)
    drp0 <- unlist(strsplit(pars, "\\+"))
    drp <- -length(drp0):-1
    df1 <- cbind(mu, md[, drp], se[, drp], mx[, drp], mn[, drp], NN[, drp])
    df2 <- cbind(df1, Parameter = lab)
    hd1 <- gsub("^\\w.*.~", "", drp0)
    names(df2) <- c(hd1, "Mean", "Median", "Std", "Min", "Max", "N", "Parameter")
    return(df2)
}

# Not Required setwd('~/where/is/the/data/?')
files <- list.files(pattern = "*.csv")
# files
data <- lapply(files, read.csv, header = T, stringsAsFactors = F, strip.white = T)

Minor modifications to make it work

I needed to write the following to read the csv with metadata row


#### Notes str(data) as expected, these are sometimes imported as character due
#### to the second row so need to fix it also because the data object is a list
#### of data.frames, it is easier to run the example if we will just create
#### individual data frames
read_metadata_csv <- function(filename) {
    dat <- read.csv(filename, skip = 1)
    col.defs <- names(read.csv(filename, nrow = 0))
    unit.defs <- read.csv(filename, nrow = 1, stringsAsFactors = F)
    attributes(dat)$unit.defs <- unit.defs[1, ]
    names(dat) <- col.defs
    return(dat)
}
# files
data <- read_metadata_csv(files[1])
# str(data)

Now finish off with the example code

#### NOTE this is not relevant here so commented out If the Genus and Species
#### is not separate, then split it up split.spp <- strsplit( data$Spp.Name, '
#### ' ) data$Genus <- sapply( split.spp, '[[', 1 ) data$Species <- sapply(
#### split.spp, '[[', 2 )



# Get the summary stats for each interesting trait
Height <- do.sumy("Height~Photosynthesis+Genus+Species", "Height", data)
# show me the data
print(xtable(Height), type = "html")

	Photosynthesis	Genus	Species	Mean	Median	Min	Max	N	Parameter
1	15.04	Eucalyptus	Saligna	15.00	15	15	15	1	Height
2	15.58	Eucalyptus	Saligna	15.00	15	15	15	1	Height
3	16.00	Eucalyptus	Saligna	15.00	15	15	15	1	Height

#### not available Vcmax <- do.sumy( 'Vcmax25~CO2.treatment+Genus+Species',
#### 'VCMAXM25', data ) Jmax <- do.sumy( 'Jmax25~CO2.treatment+Genus+Species',
#### 'JMAXM25', data ) VjVn <- do.sumy( 'J.V~CO2.treatment+Genus+Species',
#### 'VJVN', data ) Tleaf <- do.sumy(
#### 'Tleaf_avg~CO2.treatment+Genus+Species','TLEAF', data ) all.Y <- rbind(
#### Vcmax, Jmax, VjVn, Tleaf )

## # Re-arrange the columns to the correct order exp.dat1 <- subset( all.Y,
## select=c('Genus','Species','Parameter','Mean','Median','Std','Min','Max','N','CO2.treatment')
## )

## # save it somewhere write.table( exp.dat1, 'choose/folder/to/save/to.csv',
## sep=',', row.names=F, col.names=T, na='' )

Posted in Data Documentation