Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Using Reml To Input Large Number Of Column Descriptions

We recently hit an issue when using morpho to enter metadata for a large number of variables (~200). The GUI form for entering definitions and units steps through each variable, but at about 60 or 70 it starts to slow down. By the time we got to 160 it was taking more than 5 minutes to change to the next variable. To safegaurd against losing work, we kept hitting “save for later” but this got slower and apeared to freeze at the last minute… Not sure if that last save worked at all.

So I;ve come back once more to the ROpenSci EML package which is looking like a really useful way to build metadata elements automatically, with Morpho being used to proved augmentation and finesse the documents.

First thing I tried was the constructed Column Definitions and Unit Definitions example from the README

R Code:

#require(devtools)
#install_github("EML", "ROpenSci")
require(EML)
 
# The example from orig doco
dat = data.set(river = c("SAC",  "SAC",   "AM"),
               spp   = c("king",  "king", "ccho"),
               stg   = c("smolt", "parr", "smolt"),
               ct    = c(293L,    410L,    210L),
               col.defs = c("River site used for collection",
                            "Species common name",
                            "Life Stage",
                            "count of live fish in traps"),
               unit.defs = list(c(SAC = "The Sacramento River",
                                  AM = "The American River"),
                                c(king = "King Salmon",
                                  ccho = "Coho Salmon"),
                                c(parr = "third life stage",
                                  smolt = "fourth life stage"),
                                "number"))
str(dat)
eml_config(creator="Carl Boettiger <cboettig@gmail.com>")
eml_write(dat, file = "inst/doc/EML_example.xml")
# now you can import this to morpho and have a look
# note that for morpho 1.8 it wants to change the EML version from 2.1.1 to 2.1.0
# and it can't show the data yet
# so what we need to do is write this as a file to the morpho database
# save and close, note which number it assigned
dat <- data.frame(dat)
morpho_db  <- "~/.morpho/profiles/hanigan/data/hanigan/"
maxid  <-  1+max(as.numeric(dir(morpho_db)))
filename <- file.path(morpho_db,maxid)
# what is the number?
filename
write.csv(dat, filename, row.names =F, quote=F)

So now to finish what we need to add into the EML that morpho has created (22.1 in my case) just needs the reference to the dataTable.

EML Code:

...
</dataFormat>
<distribution scope="document"> <online> <url function="download">ecogrid://knb/hanigan.22.1</url>
</online>
<access authSystem="knb" order="denyFirst"><allow><principal>uid=datalibrarian,o=unaffiliated,dc=ecoinformatics,dc=org</principal>
<permission>all</permission>
</allow>
<allow><principal>uid=hanigan,o=unaffiliated,dc=ecoinformatics,dc=org</principal>
<permission>read</permission>
</allow>
</access>
</distribution>
</physical>
...

Which seems to have worked when we open it up again:

moprho-wide1.png

So now let;s try a large nuber of variables:

R Code:

# add lots of cols
for(i in 5:100){
  dat[,i] <-  sample(rnorm(100,1,2), 3)
}
str(dat)
##  $ V95  : num  1.5708 -0.0936 2.2324
##  $ V96  : num  1.79 5.4 1.62
##  $ V97  : num  -1.141 0.653 5.365
##  $ V98  : num  1.738 -1.046 -0.135
##  $ V99  : num  3.6 -0.738 -1.877
##   [list output truncated]

# firstly I make a liset of the unit definitions for the example
unit.defs <- list(c(SAC = "The Sacramento River",
                   AM = "The American River"),
                 c(king = "King Salmon",
                   ccho = "Coho Salmon"),
                 c(parr = "third life stage",
                   smolt = "fourth life stage"))
# then I add to it the definition for the constructed variables
unit.defs[[3]]
for(i in 4:100){
  unit.defs[[i]] <- "number"
}
unit.defs

# and this can be passed to the data.set constructor
dat <- data.set(dat,
               col.defs = c("River site used for collection",
                            "Species common name",
                            "Life Stage",
                            "count of live fish in traps",
                           c(rep("count stuff", 95))),
                unit.defs = unit.defs
                )
str(dat)
 
eml_config(creator="Ivan Charles Hanigan <ivan.hanigan@gmail.com>")
eml_write(dat, file = "inst/doc/EML_example_wide.xml")
# import to morpho, save and close
# create the dataset for morphos database
dat <- data.frame(dat)
morpho_db  <- "~/.morpho/profiles/hanigan/data/hanigan/"
maxid  <-  1+max(as.numeric(dir(morpho_db)))
filename <- file.path(morpho_db,maxid)
# what is the number?
filename
write.csv(dat, filename, row.names =F, quote=F)
# now add this into the EML morpho has created (25.2 in my case)

Which now seems to have attached the variable defintions and dataTable adequately.

morpho-wide2.png

name: using-reml-to-input-large-number-of-column-descriptions layout: post title: Using Reml To Input Large Number Of Column Descriptions date: 2014-04-24 categories:

  • Data Documentation

We recently hit an issue when using morpho to enter metadata for a large number of variables (~200). The GUI form for entering definitions and units steps through each variable, but at about 60 or 70 it starts to slow down. By the time we got to 160 it was taking more than 5 minutes to change to the next variable. To safegaurd against losing work, we kept hitting “save for later” but this got slower and apeared to freeze at the last minute… Not sure if that last save worked at all.

So I;ve come back once more to the ROpenSci EML package which is looking like a really useful way to build metadata elements automatically, with Morpho being used to proved augmentation and finesse the documents.

First thing I tried was the constructed Column Definitions and Unit Definitions example from the README

R Code:

#require(devtools)
#install_github("EML", "ROpenSci")
require(EML)
 
# The example from orig doco
dat = data.set(river = c("SAC",  "SAC",   "AM"),
               spp   = c("king",  "king", "ccho"),
               stg   = c("smolt", "parr", "smolt"),
               ct    = c(293L,    410L,    210L),
               col.defs = c("River site used for collection",
                            "Species common name",
                            "Life Stage",
                            "count of live fish in traps"),
               unit.defs = list(c(SAC = "The Sacramento River",
                                  AM = "The American River"),
                                c(king = "King Salmon",
                                  ccho = "Coho Salmon"),
                                c(parr = "third life stage",
                                  smolt = "fourth life stage"),
                                "number"))
str(dat)
eml_config(creator="Carl Boettiger <cboettig@gmail.com>")
eml_write(dat, file = "inst/doc/EML_example.xml")
# now you can import this to morpho and have a look
# note that for morpho 1.8 it wants to change the EML version from 2.1.1 to 2.1.0
# and it can't show the data yet
# so what we need to do is write this as a file to the morpho database
# save and close, note which number it assigned
dat <- data.frame(dat)
morpho_db  <- "~/.morpho/profiles/hanigan/data/hanigan/"
maxid  <-  1+max(as.numeric(dir(morpho_db)))
filename <- file.path(morpho_db,maxid)
# what is the number?
filename
write.csv(dat, filename, row.names =F, quote=F)

So now to finish what we need to add into the EML that morpho has created (22.1 in my case) just needs the reference to the dataTable.

EML Code:

...
</dataFormat>
<distribution scope="document"> <online> <url function="download">ecogrid://knb/hanigan.22.1</url>
</online>
<access authSystem="knb" order="denyFirst"><allow><principal>uid=datalibrarian,o=unaffiliated,dc=ecoinformatics,dc=org</principal>
<permission>all</permission>
</allow>
<allow><principal>uid=hanigan,o=unaffiliated,dc=ecoinformatics,dc=org</principal>
<permission>read</permission>
</allow>
</access>
</distribution>
</physical>
...

Which seems to have worked when we open it up again:

moprho-wide1.png

So now let;s try a large nuber of variables:

R Code:

# add lots of cols
for(i in 5:100){
  dat[,i] <-  sample(rnorm(100,1,2), 3)
}
str(dat)
##  $ V95  : num  1.5708 -0.0936 2.2324
##  $ V96  : num  1.79 5.4 1.62
##  $ V97  : num  -1.141 0.653 5.365
##  $ V98  : num  1.738 -1.046 -0.135
##  $ V99  : num  3.6 -0.738 -1.877
##   [list output truncated]
               unit.defs = list(c(SAC = "The Sacramento River",
                                  AM = "The American River"),
                                c(king = "King Salmon",
                                  ccho = "Coho Salmon"),
                                c(parr = "third life stage",
                                  smolt = "fourth life stage"))
unit.defs[[3]]
for(i in 4:100){
unit.defs[[i]] <- "number"
}
unit.defs
dat = data.set(dat,
               col.defs = c("River site used for collection",
                            "Species common name",
                            "Life Stage",
                            "count of live fish in traps",
c(rep("count stuff", 95))
                 ),
unit.defs = unit.defs
                 
  )
str(dat)
 
eml_config(creator="Ivan Charles Hanigan <ivan.hanigan@gmail.com>")
eml_write(dat, file = "inst/doc/EML_example_wide.xml")
# import to morpho, save and close
dat <- data.frame(dat)
morpho_db  <- "~/.morpho/profiles/hanigan/data/hanigan/"
maxid  <-  1+max(as.numeric(dir(morpho_db)))
filename <- file.path(morpho_db,maxid)
# what is the number?
filename
write.csv(dat, filename, row.names =F, quote=F)
# now add this into the EML morpho has created (25.2 in my case)

Which now seems to have attached the variable defintions and dataTable adequately.

morpho-wide2.png

Posted in  Data Documentation


linking-eml-packages-by-umbrella-project-info

In Eml the optional “project” module provides an overall description of the larger-scale project or research context with which that dataset is associated. For the examples in our work the “project” will most often be an LTER (Longterm Ecological Research Network) site that directed the research. Accordingly, the “title” here consists of the name of the LTER site. The “personnel” group contains the same elements as “creator” and “contact”, with the addition of a mandatory “role” element, and it is used to identify the lead PI and/or information manager on the site. Other optional elements in the “project” module include “abstract”, “funding”, “studyAreaDescription”, and “designDescription”, each of which can be used to provide a richer textual description of the LTER site responsible for the research project being documented. If used, the “abstract” includes basic information about the LTER site, such as its general history and administration, while “studyAreaDescription” is more of a physical description of the area where the site is located. This description may also include the “coverage” module, which is fully discussed on page XX of this handbook, or the “citation” module, covered on page XX. The “funding” tag is textual and self-explanatory, but “designDescription” is best used for a description of the site’s database information and availability.

Morpho doesn’t give all these options so you need to go to the EML file found in the “~/.morpho/…” directory and edit this with a text editor I think the order of the tags here might make a difference so I always put thye “abstract” tag after the “/personnel” tag. I also think you might need this “para” tag:

Code:linking-eml-packages-by-umbrella-project-inf

<abstract> 
  <para>Prof McMichael set up this group to develop new methods of researching Environmental (especially Climate Change) and Health
  </para>
</abstract>

So this gives the overarching project a valid reference, but how to provde the links for interested readers to find out more? First we can include a link to the project homepage from the eml/dataset/abstract node, but also we can provide URLs in a machine-readable way by inserting an “additionalLinks” node at the bottom of the EML:

Code:linking-eml-packages-by-umbrella-project-info

<additionalMetadata>
  <metadata>
    <additionalLinks>
      <url name="The name of the homepage">http://...</url>
    </additionalLinks>
  </metadata>
</additionalMetadata>

Posted in  Data Documentation


Using Morpho for Cataloguing Personal Research Data

1 2014-04-20-using-morpho-orgmode

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

1.1Introduction

The collection of scientific data is undertaken at an individual level by everybody in their own way. The layout of data collections that I have seen is incredibly varied; spread across multiple files and folders which can be difficult to navigate or search through. In some cases these collections are incomprehensible to all but the individual themselves. Given that a lot of projects are collaborative in nature and require extensive sharing, it is important that scientists maintain their data collection by some form of system that allows easy data extraction and use in other projects. Therefore, the maintenance of a personal catalogue of datasets is an important activity for scientists.

By cataloguing I mean that a file or database is kept that stores all the information about the names of the datasets (and any other files the data may be spread across), where the datasets are located, any references (papers) that were developed from it and finally important information regarding the conditions it was formed under.

While this may seem laborious, it keeps track of all the data that one has collected over time and gives one a reference system to find a dataset of interest when sharing with collaborators. Datasets can be saved in any filing system the scientist chooses, but with the help of their personal data catalogue they will always know the status of their data collection.

1.2 Cataloguing Personal Research Data with Morpho

[Metacat](https://knb.ecoinformatics.org/knb/docs/intro.html) is an online repository for data and metadata. It is a great resource for the publication of data, but not very useful for an individual scientist to use on their personal computer. However [Morpho](https://knb.ecoinformatics.org/#tools/morpho) the Metadata Editor used by Metacat may be used locally by a researcher to catalogue their collection (and ultimately this will make publishing elements of the collection easier.) Morpho uses the Ecological Metadata Language (EML) to author metadata with a graphical user interface wizard.

I am using morpho 1.8 due to my group using older metacat server.

1.3 How Morpho Works

When you install Morpho it creates a directory where you can run the program from, and another hidden directory called ".morpho" for it's database of all your metadata (and optionally any data you import to it). Below is an image of mine, with a couple of test records I played around with (the XMLs/HTMLs) and a dataset I imported (the text file).

  • ~/.morpho/profiles/hanigan/data/hanigan/

morphodir1.png

Every time a modification is made to the metadata a new XML is saved here, with the major number being the ID of the package and incremented minor number to reflect the change.

The GUI is tedious.

1.4 Adding a dataset from my collection

I have already got a good amount of metadata I generated when I published [the drought data](http://dx.doi.org/10.4225/13/50BBFD7E6727A)

1.5 the drought dataset:

Hanigan, Ivan (2012): Monthly drought data for Australia 1890-2008 using the Hutchinson Drought Index. Australian National University Data Commons. DOI: 10.4225/13/50BBFD7E6727A.

<p></p>

1.6 Step One: define the project that I will keep locally

1.7 Contextual Metadata

1.8 Abstract

I originally wrote the abstract as the description for a RIF-CS metadata object to publish for the ANU library.

I got the following instructions from a Librarian: The "informative abstract" method.

  • The abstract should be a descriptive of the data, not the research.
  • Briefly outline the relevant project or study and describe the contents of the data package.
  • Include geographic location, the primary objectives of the study, what data was collected (species or phenomena), the year range the data was collected in, and collection frequency if applicable.
  • Describe methodology techniques or approaches only to the degree necessary for comprehension – don’t go into any detail.
  • Cite references and/or links to any publications that are related to the data package.
  • Single paragraph
  • 200-250 words
  • Use active voice and past tense.
  • Use short complete sentences.
  • Express terms in both their abbreviated and spelled out form for search retrieval purposes.

1.9 Australian FOR codes

ANZSRC-FOR Codes: Australian and New Zealand Standard Research Classification – Fields of Research codes allow R&D activity to be categorised according to the methodology used in the R&D, rather than the activity of the unit performing the R&D or the purpose of the R&D. http://www.abs.gov.au/Ausstats/abs@.nsf/Latestproducts/4AE1B46AE2048A28CA25741800044242?opendocument

1.10 GCMD Keywords

Olsen, L.M., G. Major, K. Shein, J. Scialdone, S. Ritz, T. Stevens, M. Morahan, A. Aleman, R. Vogel, S. Leicester, H. Weir, M. Meaux, S. Grebas, C.Solomon, M. Holland, T. Northcutt, R. A. Restrepo, R. Bilodeau, 2013. NASA/Global Change Master Directory (GCMD) Earth Science Keywords. Version 8.0.0.0.0 http://gcmd.nasa.gov/learn/keyword_list.html

1.11 Geographic coverage

> require(devtools) > installgithub("disentangle", "ivanhanigan") > require(disentangle) > morphoboundingbox(d) X1 X2 X3 1 <NA> 10.1356954574585 S <NA> 2 112.907211303711 E <NA> 158.960372924805 E 3 <NA> 54.7538909912109 S <NA>

1.12 Save the metadata

  • the metadata is now ready to save to my .morpho catalogue
  • without importing any data

morphoimg2.png

  • this appears as a new XML

morphoimg3.png

  • which looks like this

morphoimg4.png

1.13 Additional Metadata

As this is metadata only about the dataset, it is innapropriate to refer to related publications etc in these elements. Luckily EML has the additionalMetadata and additionalLinks fields. Just open the XML and paste the following in the bottom.

<additionalMetadata> <metadata> <additionalLinks> <url name="Hanigan, IC, Butler, CD, Kokic, PN, Hutchinson, MF. Suicide and Drought in New South Wales, Australia, 1970-2007. Proceedings of the National Academy of Science USA 2012, vol. 109 no. 35 13950-13955, doi: 10.1073/pnas.1112965109">http://dx.doi.org/10.1073/pnas.1112965109</url> </additionalLinks> </metadata> </additionalMetadata>

You can see this if you then open it up again in morpho and then under the documentation menu go to Add/Edit Documentation.

morphoimg5.png

</div> </div>

Posted in  Data Documentation


A Workaround For Inserting Species Names To Morpho

Morpho is a pretty minimal editor for EML really. It gives you a set of generically useful data entry forms but sometimes a specific task is better achieved through edits made directly to the XML document. An example of this is inserting a large number of species names to the taxonomic coverage module. The form to include these requires individual data entry for each species.

morpho-taxo.png

Which looks like this when published

morpho-taxo2.png

Go to the morpho catalogue (found at ~/.morpho/profiles/hanigan/data/hanigan/) and take a look at how the XML is constructed.

XML Code:

<taxonomicCoverage>
  <taxonomicClassification>
    <taxonRankName>Genus</taxonRankName>
    <taxonRankValue>Dasycercus</taxonRankValue>
    <taxonomicClassification>
      <taxonRankName>Species</taxonRankName>
      <taxonRankValue>cristicauda</taxonRankValue>
      <commonName>Mulgara</commonName>
    </taxonomicClassification>
  </taxonomicClassification>
  <taxonomicClassification>
    <taxonRankName>Genus</taxonRankName>
    <taxonRankValue>Homo</taxonRankValue>
      <taxonomicClassification>
      <taxonRankName>Species</taxonRankName>
      <taxonRankValue>sapiens</taxonRankValue>
      <commonName>Modern Human</commonName>
    </taxonomicClassification>
  </taxonomicClassification>
</taxonomicCoverage>

SO how to insert a large number of these

One could use the taxon import feature to import the details from a file if there is a large list. Morpho’s taxon import feature does not correctly import more than one column of taxon data so if you have Genus and Species to enter you will actually need to combine Genus and Species into a binomial Species rank in one column (beware that if the data are sourced from a datafile that uses the underscore to seperate the words then these will not be correctly imported).

Once you have formated your input list, then import this as a new data table and go to the taxonomic coverage form under the documentation menu. Click on the option to “import taxon information from data table” and select the appropriate column, selecting ‘species’ for the class. This will populate the taxonomicCoverage module in the EML. You can now remove that data table from the package to be tidy.

This is what it looks like if you combine genus and species in the single column.

morpho-taxo3.png

And here is the XML

XML Code:

<taxonomicCoverage scope="document">
  <taxonomicClassification>
    <taxonRankName>Species</taxonRankName>
    <taxonRankValue>Abelmoschus moschatus</taxonRankValue>
  </taxonomicClassification>
  <taxonomicClassification>
    <taxonRankName>Species</taxonRankName>
    <taxonRankValue>Abrus pector</taxonRankValue>
  </taxonomicClassification>
  <taxonomicClassification>
    <taxonRankName>Species</taxonRankName>
    <taxonRankValue>Abrus precatorius</taxonRankValue>
  </taxonomicClassification>

Morpho has problems subsequently editing a very long list

We found that if a very large amount of taxonomic information is entered into Morpho we had issues modifying it. When you click on Documentation > Taxonomic Coverage to try and go in and edit nothing will happen. Morpho crashes when trying to open the Taxonomic Coverage because the list is long enough to cause “Out of Memory” error with the default configuration of Java heap space. It is a Morpho bug. The workaround is to edit the XML file manually.

Posted in  Data Documentation


What is this Open Notebook? And Why Am I Doing It?

I just revised the content of the “About My Notebook” page and thought it was also relevant to post as an entry.

Welcome to my Open Notebook

This is the public face of my Open Notebook, in which I keep all the details of the data, code and documents related to my research. This is an Open Notebook with Selected Content - Delayed and aligns with the principles of the Open Notebook Science (ONS) movement. The private side of my Open Notebook (the closed bit) is private either because it includes unpublished work that I wish to keep embargoed until after publication, or because it is all the gory, messy details of the day-to-day business of writing and rewriting code and prose to analyse data and make sense of the data I am analysing. These elements of the notebook do not look like standalone journal entries and I store my personal archive either hosted by GitHub for the public parts (thanks to their superior integration with Jekyll websites thanks to gh-pages for each repository) or BitBucket for the private bits (thanks to bitbucket’s free unlimited private repositories).

Categories

The different categories can be thought of as seperate lab notebooks. My projects are connected by being placed into one of these categories.

What is Open Notebook Science? And Why am I doing it?

In 2005 Jean-Claude Bradley launched a web-based initiative called UsefulChem and named his new technique Open Notebook Science (ONS). He described it as a way of doing science in which you make all your research freely available to the public in real time. The proposed benefits include greater impact on the public good and enhanced ability to connect with like-minded collaborators. Proposed risks of ONS practice include being scooped by competitors or falling foul of Journal rules regarding prior publication and licencing of Intellectual Property. To mitigate the proposed risks the concept of ONS was broadened to allow research to be made public after a delay.

In 2010 Carl Boettiger initiated an experiment “to see if any of the purported benefits or supposed risks were well-founded.” After three years of his experiment Boettiger reported that his “evidence suggests that the practice of open notebook science can faciliate both the performance and dissemination of research while remaining compatible and even synergistic with academic publishing.”

This promising result has inspired me to follow these practices in my own part-time PhD and my full-time work as Data Manager at a University (to the extent I am allowed to by the rules of the University and the willingness of my boss to share our results).

Posted in  overview