Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

cwt-lter-data-submission-template-critique

  • I recently reviewed a tool for collecting metadata about Long Term Ecological Resarch (LTER) data
  • it is just an Excel spreadsheet called cwt_data_subm_template_2013.xls
  • You can download a copy here www.coweeta.uga.edu/resources/forms/cwt_data_subm_template_2013.xls
  • LTER is The U.S. Long-Term Ecological Research (LTER) network
  • I made the following notes, this is not intended to be a nasty critique
  • The following is a few Frank and Fearless comments I’ll be using to compare the pros and cons of a variety of data documentation approaches

Critique

  • opened first on windows, saw comments on cells with instructions
  • opened next on linux with libreOffice and comments are gone
  • opened at the last tab (split in two for no reason?)
  • noticed recommended name “GCE site” = Site, otherwise “permanent plot” = Plot?
  • GCE = Georgia Coastal Ecosystems LTER program
  • flip to first tab, point 4 suggests there is some export functionality I cannot see (a VBA script?)
  • cell 11 a NOTE: When submitting updated metadata or re-using templates please highlight fields with modified contents in yellow
  • and use glitter pen??? (See this great post called excel-is-not-your-lab-notebook!
  • personnell tab OK
  • instrumentation, variable measured is free text. ok but for eg “max temp”, “temperature maxima”, “maximum temperature (c)” “maximum temperature in 24 hours after 9am local time in degrees” etc
  • too wide, last column was off my wide screen! noticed wasted real estate in column A

Moving on to the tabular data sheet

  • I don’t like this “– Paste or enter your data values into the ‘Values’ section (white cells), starting with the indicated cell”
  • this is an invitation for clerical error! Too many “copy-and-paste” actions will inevtably introduce errors

I do like the extra metadata Column Name:

  • Description:
  • Units:
  • Data type:
  • Variable type:
  • Number type:
  • Precision:
  • Code values:
  • Calculations:
  • QC: Minimum Valid:
  • QC: Minimum Expected:
  • QC: Maximum Expected:
  • QC: Maximum Valid:
  • QC: Custom:

Other issues:

– Fill in missing values in the table with NaN (not a number), including text fields, and do not skip columns

  • but what about missing values imbued with other meanings (NA = not observed, censored etc)?
  • ask users to format digit rounding in Excel?? oh no
  • old excel users may still be restricted to 65,536 rows by 256 columns.
  • non tabular sheet is ok

Posted in  Data Documentation


morpho-has-an-issue-with-zero-length-strings-as-missing-data

We encountered a strange issue Morpho 1.8 has when ingesting a CSV with zero length strings as missing data. An example of what this looks like is given below in the mulgara example dataset I;ve shown before. I have added some rows with missing data as zero length strings (these are the “,,” after the fictional value Treat = 0.5).

R Code:

datatext <- 'Treat, Before, After1, After2
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
a,,-9999,9999
0,  2.833213344,    1.609437912,    2.48490665
0,  1.791759469,    2.197224577,    2.079441542
0,  3.044522438,    2.708050201,    3.135494216
0,  2.772588722,    1.791759469,    2.197224577
0,  1.098612289,    1.609437912,    2.63905733
1,  2.944438979,    0.693147181,    1.791759469
1,  2.564949357,    0.693147181,    1.791759469
1,  2.564949357,    1.609437912,    1.609437912
1,  0.693147181,    1.098612289,    1.098612289
1,  1.609437912,    0,      1.098612289'
analyte <- read.csv(textConnection(datatext))
write.csv(analyte, "inst/extdata/morpho-bug-empty-na.csv", row.names = F, na= "", quote=F)

  • Unfortunately for this blog entry, the example above doesn;t exhibit the problem!
  • you can see what happens by selecting the “treat consecutive delimiters as as one” option
  • showing that the data from the 3rd and 4th cols shifts to the right
  • this is what happened to us but we were not able to modify this behaviour by changing that option!
  • The problem only occurred when we had a file big file (200+ columns)
  • the problem occured at col 23, shown in the image below

morpho-bug-empty-na3.png

  • this column is ftyp[12] which has 62 rows of missing and then some text codes.
  • the next 24th column is aerial[10] and it is this one that is then read incorrectly by morpho. The importer makes the error of reading blanks for the first few rows and then *, even though in reality it is * and then date 1/06/2006 (shown in the image below)

morpho-bug-empty-na2.1.png

  • and finally col 25 fire[13] now appears as * and the date 1/06/2006 but it should be a different number of *s and the date 2/04/2007 further down

morpho-bug-empty-na2.png

Workaround

  • to deal with this in this case we replaced all zero length strings as “NA”
  • this is probably a good idea in most cases, except when NA is not an appropriate value and so in those cases we replaced with the word “blank”

Conclusions

  • This issue is probably rare and might be fixed in the newer version of Morpho
  • We don;t use new morpho because we are still using the old metacat
  • I thought it worth reporting here

Posted in  Data Documentation


workaround-for-installing-morpho-on-a-windows-network

Introduction

Morpho is an open source piece of software designed to host all kinds of ecological data. It is used to describe data collections. We publish to Metacat using Morpho. For technical reasons, we have our own server running an older version of the Metacat software so have to run an old version of Morpho (v1.8).

Installation and configuration

TERN specific instructions are replicated across:

Morpho issue on Windows in our network environment

The ANU Fenner School came across this issue. Our local IT department have discovered with Morpho (v1.8) Installation on Windows in our network environment, Morpho installs a hidden directory called “.morpho” and this is written to a PATH next to the User Desktop. This is actually on a network share with a very small quota.

Therefore as Morpho writes data there it can easily exceed the quota causing the software to stop working. All users of Morpho on such a network should check for the location of the “.morpho” directory and if it is located with such a restricted quota then alternative arrangements need to be made.

At the ANU Fenner School we opted to set up local user accounts (the “.morpho” file is then on the C drive) and this user is accessible from the main User’s account. The Morpho files are therefore stored on the C drive which is of course a danger for hard disk failure and data loss.

We feel that as long as the draft packages are saved to the networked Metacat (with the maximum access restrictions in force for draft work) then if the local hard disk dies then the draft data package is not lost but can be accessed again from a new Morpho install and the Metacat server.

Issues remain about the long term solution to this problem but this will work in the short-term.

We think you can also create a junction (symlink) so that .morpho/ can point to a different location (perhaps another network storage with bigger capacity and automated backup).

Posted in  Data Documentation


using-r-eml-to-input-large-numbers-of-variables-part-2

In my previous post I showed a workaround to input large number of variabels than Morpho could handle. I have found out an interesting thing about the R EML package data.set approach. The unit.defs depends on the type of the column in the data frame (R knows if you try to trick it). Rather than describe all the levels of the factor vars, I wanted to just have simple descriptions for numeric or character type columns. If numeric it should be number and if character it can be the name of the column. Because I am just trying to sidestep the issue with morpho failing to input the 200 plus variables I just need to create a dataframe with either number or character (not factor).

For an example:

R Code:

require(EML)
fname <- dir("myData", full.names=T, pattern = 'csv')
fname
fname  <- "myData/paper_data.csv"  
dat  <- read.csv(fname, stringsAsFactor=F)
# the variable names include special characters such as % and [ so need to do some extra work
names_dat  <- read.csv(fname, nrow=1, header=F, stringsAsFactor=F)
names(dat) <- names_dat
head(dat)
nrow(dat)
# the first 46 variables are nominal, thereafter they are all numeric
str(dat[,1:46])
str(dat[,47:49])
# but these numerics are character for some reason
table(dat[ , 47])
# it is because the raw data had * instead of NA...  I could use na.strings when I call read.table above, or deal with it here
# I'll just convert them to numerics now
dat1 <- dat[,1:46]
dat2 <- dat[,47:ncol(dat)]
dat2[dat2 == "*"] <- NA
for(i in 1:ncol(dat2)){
  dat2[,i]  <- as.numeric(dat2[ , i])
}
str(dat2)
# good that has set them up properly, also make sure all the text are character type
for(i in 1:ncol(dat1)){
  dat1[,i]  <- as.character(dat1[ , i])
}
# recombine
dat <- cbind(dat1,dat2)
str(dat)
# then I add to it the definition for the constructed variables
unit_defs <- list()
for(i in 1:46){
    unit_defs[[i]] <- names(dat[i])
}
for(i in 47:ncol(dat)){
    unit_defs[[i]] <- "number"
}
unit_defs[1:20]
col_defs <- names(dat)
col_defs
#names(dat)
dat <- data.set(dat,
               col.defs = col_defs,
               unit.defs = unit_defs
               )
str(dat)

eml_config(creator="Ivan Charles Hanigan <ivan.hanigan@gmail.com>")
dir()
oldwd <- getwd()
setwd("myData")
eml_write(dat, file = "test_eml.xml", title = "test_eml")
setwd(oldwd)

Use Morpho to tidy up the Results

  • The result can be imported to Morpho and then edited from there
  • or you can use a code editor like emacs or notepad++ to do bulk find-and-replace operations to fix up issues

Conclusions

  • Morpho failed with 200 plus variables
  • the R EML package succeeded to write the file
  • I needed to pay careful attention to the type of the variables in the data.frame before running the R EML functions

Posted in  Data Documentation


Tweaking R Eml Package Outputs With Morpho And Failing That With Emacs

I realised that the work I did yesterday had an error in it. I’d created a list of unit.defs with one less than I needed. So the variable V100 was given the definition of NA! I can fix this by open that package in Morpho and go to the column and select it, then under the Data menu, choose Edit Column Documentation. There we can change all the definitions. Now I start thinking about changing the type from count to ArealDensity, numberPerMeterSquared. To change all these new variables I don’t want to go thru the GUI 95 times! Change the first one and then look at the EML

EML Code:

<attribute id="1398382425052">
  <attributeName>V99</attributeName>
  <attributeDefinition>count stuff</attributeDefinition>
  <measurementScale>
    <ratio>
      <unit><standardUnit>number</standardUnit></unit>
      <numericDomain><numberType>real</numberType></numericDomain>
    </ratio>
  </measurementScale>
</attribute>
<attribute id="1398382385863">
  <attributeName>V100</attributeName>
  <attributeDefinition>A random variable</attributeDefinition>
  <measurementScale>
    <ratio>
      <unit><standardUnit>numberPerMeterSquared</standardUnit></unit>
      <numericDomain><numberType>real</numberType></numericDomain>
    </ratio>
  </measurementScale>
</attribute>

So this looks like I can do a find and replace in Emacs so go to the “~/.morpho” database and make a copy of the appropriate EML (number was in the morpho saved) and rename it with the increment up one of the minor version. Open this and run the find and replace.

Code:

<standardUnit>number<
to
<standardUnit>numberPerMeterSquared<

Which was a much quicker way to redefine all these. Save to our Metacat Network and it shows up as I wanted it to.

morpho-wide3.png

Conclusions

  • Morpho and a code editor can be used in conjunction to edit EML quite well
  • I suspect edits to the EML directly with a code editor are pretty dangerous
  • For eg if you do the wrong change then that EML will likely not be valid and Morpho will complain.
  • But the advantage of quickly modifying things like 100 variables unit definitions rather than opening every one in the GUI seems to be a worthwhile risk to me.

Posted in  Data Documentation