Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Using the R EML software to mitigate risks in Morpho and Metacat data publishing

Introduction

  • Over the last few months I have used software called Metacat as a Data Portal and Repository. Metacat is server software which has been developed by the Knowledge Network for Biocomplexity (KNB).
  • Metacat conforms to the Ecological Metadata Language (EML) Standard (https://knb.ecoinformatics.org/#external//emlparser/docs/index.html).
  • KNB also develop another software package called Morpho to be used by Ecologists to document their data (https://knb.ecoinformatics.org/#tools/morpho).
  • Morpho can be used to send the data and metadata documents to be published on a Metacat portal.
  • KNB’s software is used internationally by the Data Observation Network for Earth (DataONE) nodes, the United States Long Term Ecological Research (US LTER) network and the International Long Term Ecological Research (ILTER) network.
  • Additionally, the Australian Long Term Ecological Research Network Data Portal (www.ltern.org.au/knb/), Australian SuperSites Network and Australian Centre for Ecological Analysis and Synthesis used the same underlying technology to publish data packages.
  • The Metacat system is great for a data repository but unfortunately (in my experience) the Morpho software package has repeatedly hampered data processing and increased risks of inadvertently publishing data with errors.
  • My colleagues and I workaround these problems using a lot of different ‘fixes’ for the different problems.
  • Fortunately there is an alternative to Morpho in the R statistical software environment called the R-EML package (https://github.com/ropensci/EML). This provides a library of functions used in the R language to generate and parse EML files.
  • This new workflow mitigates some of the risks of the Morpho software by ensuring the data related steps of the workflow are conducted in the R environment for statical computing.
  • However, some Issues remain in that this requires a fairly specialised computing environment with various Linux libraries configured appropriately

Results

  • I generate EML metadata using REML in the workflow shown in the figure below.

altext

Posted in  morpho data documentation


tuftes-gantt-alternative-for-detail-within-context

  • During the end of 2014 I found that the Gantt Chart by TaskJuggler was a struggle to really achieve any decent task management with (fine for higher level overviews though).
  • I had been following the approach described at this link
  • I decided to code up an alternative based on the theory explained on this link

Project Management Graphics (or Gantt Charts), by Edward Tufte

Computer screens are generally too small for an overview of big
serious projects. Horizontal and vertical scrolling are necessary to
see more than about 40 horizontal time lines for a reasonable period
of time. Thus, for large projects, print out the sequence on a big
roll of paper and put it up on a wall.
 
The chart might be retrospective as well as prospective. That is, the
chart should show actualdates of achieved goals, evidence which will
continuously reinforce a reality principle on the mythical future
dates of goal achievement.
 
Most of the Gantt charts are analytically thin, too simple, and lack
substantive detail. The charts should be more intense. At a minimum,
the charts should be annotated--for example, with to-do lists at
particular points on the grid. Costs might also be included in
appropriate cells of the table.
 
About half the charts show their thin data in heavy grid prisons. For
these charts the main visual statement is the administrative grid
prison, not the actual tasks contained by the grid. No explicitly
expressed grid is necessary--or use the ghost-grid graph
paper. Degrid!

The Results:

I used the example for a fictional Journal Paper submission from my favourite reference for anything to do with Project Management:

Aragon, T., Mier, H. M., Payauys, T., & Siador, C. (2012). 
Project Management for Health Professionals.   [http://www.academia.edu/1746564/Project_Management_for_Health_Professionals](http://www.academia.edu/1746564/Project_Management_for_Health_Professionals)    

With the following results (PS SVG format allows you to zoom in).

alttext2

The codes:

library(disentangle)
library(sqldf)
library(lubridate)

datin  <- read.csv(
textConnection("
container_task_title  , task_id                      , allocated , fte , blocker               ,       start_date , effort , status , notes 
01 Start              , Start                        , ivan      ,   1 , NA                    ,       2015-03-15 ,     1d , DONE   , NA    
02 Update Lit Review  , Repeat MEDLINE search        , ivan      ,   1 , Start                 ,       2015-03-16 ,     5d , DONE   , NA    
02 Update Lit Review  , Retrieve articles            , ivan      ,   1 , Repeat MEDLINE search ,               NA ,     5d , DONE   , NA    
02 Update Lit Review  , Read articles                , ivan      ,   1 ,                       ,       2015-03-26 ,    11d , DONE   ,       
02 Update Lit Review  , Summarize articles           , ivan      ,   1 ,                       ,       2015-04-06 ,     9d , TODO   ,       
03 Write Draft        , Write introduction           , ivan      ,   1 ,                       ,       2015-04-09 ,     6d , TODO   ,       
03 Write Draft        , Write methods                , ivan      ,   1 , Start                 ,                  ,    15d , TODO   ,       
03 Write Draft        , Write results                , ivan      ,   1 ,                       ,       2015-03-30 ,    10d , TODO   ,       
03 Write Draft        , Write discussion             , ivan      ,   1 ,                       ,       2015-04-15 ,    10d , TODO   ,       
04 Internal Review    , Send to co-author for review , ivan      ,   1 , Write discussion      ,                  ,     2d , TODO   ,        
04 Internal Review    , Revise draft 1               , ivan      ,   1 ,                       ,       2015-04-19 ,    10d , TODO   ,       
05 Peer Review        , Submit article 1             , ivan      ,   1 , Revise draft 1        ,                  ,     5d , TODO   ,       
06 Revise and Resubmit, Revise draft 2               , ivan      ,   1 ,                       ,       2015-04-30 ,    10d , TODO   ,       
06 Revise and Resubmit, Submit article 2             , ivan      ,   1 , Revise draft 2        ,                  ,     5d , TODO   ,       
07 End                , Accepted                     , ivan      ,   1 ,                       ,       2015-05-15 ,     1d , TODO   ,       
"),
stringsAsFactor = F, strip.white = T)
# or 
# datin <- get_gantt_data("gantt_todo", test_data = T) # need to
# adjust min_context_xrange to 2015-01-01 or something
datin$start_date  <- as.Date(datin$start_date)
str(datin)
datin

dat_out <- gantt_data_prep(dat_in = datin)
str(dat_out)
dat_out
svg("tests/gantt_tufte_test.svg",height=10,width=8)
gantt_tufte(dat_out, focal_date = "2015-04-13", time_box = 3*7,
            min_context_xrange = "2015-03-16",
            cex_context_ylab = 0.65, cex_context_xlab = .7,
            cex_detail_ylab = 0.9,  cex_detail_xlab = .4,
            show_today = F)
dev.off()

Posted in  project management


climate-grids-and-thredds-server-experimenting

UPDATE

THIS POST IS DEPRECATED. SEE /2015/07/climate-grids-and-thredds-server-experimenting-update

Abstract

  • Climate grids have become readily available via netCDF and THREDDS

Methods

  • Use the R package ncdf, read data by changing the URL passed to the ncdf reader
  • I wanted to convert the matrix to a spatial raster object but struggled with the ‘flipped’ orientation of the input I got.
  • Found this link [[http://stackoverflow.com/a/137]] which says “The reason is that the NetCDF interface you are using is very low-level, and all you have done is read out the variable without any of its dimension information. The orientation of the grid is really arbitrary, and the coordinate information needs to be understood in a particular context” AND SO:
  • first transpose the matrix
  • r<-raster(t(vals), …
  • then in another test of a DIFFERENT grid product I had to use the flip tool
  • d <- flip(r, direction = “y”)
  • That flipped around “y”, keeping the georeferencing from the original context.

Results

eMAST Grids test

eMAST Grids test

# ref http://www.emast.org.au/observations/climate/
#install.packages("ncdf", type = "source", configure.args="--with-netcdf-include=/usr/include")
require(ncdf)
## Loading required package: ncdf
#install.packages("raster")
require(raster)
## Loading required package: raster
## Loading required package: sp
# install.packages("rgdal")
require(rgdal)
## Loading required package: rgdal
## rgdal: version: 0.9-1, (SVN revision 518)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.9.2, released 2012/10/08
## Path to GDAL shared files: /usr/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
# if extracting for points shapefile
#shp <- readOGR(dsn="test.shp", layer='test')
#plot(shp, add = T)

# a loop through days, see comment sections that print for debugging
strt <-'2012-01-01'
end <- '2012-01-04'
dates <- seq(as.Date(strt),as.Date(end),1)          
dates
## [1] "2012-01-01" "2012-01-02" "2012-01-03" "2012-01-04"
# if extracting to shp then set up an output dataframe to collect
#dat_out <- as.data.frame(matrix(nrow = 0, ncol = 4))
# else just plots
par(mfrow = c(2,2))
for(i in 1:length(dates)){
#  i=1
  date_i <- dates[i]
  infile <- sprintf("http://dapds00.nci.org.au/thredds/dodsC/rr9/Climate/eMAST/ANUClimate/0_01deg/v1m0_aus/day/land/tmin/e_01/2012/eMAST_ANUClimate_day_tmin_v1m0_%s.nc", gsub("-", "", date_i))

  nc <- open.ncdf(infile)
  vals <- get.var.ncdf(nc, varid="air_temperature")
  nc.att <- nc$var$air_temperature
  xmin <- min(nc.att$dim[[1]]$vals)
  xmax <- max(nc.att$dim[[1]]$vals)
  ymin <- min(nc.att$dim[[2]]$vals)
  ymax <- max(nc.att$dim[[2]]$vals)

  print(c(xmin,xmax))
  print(c(ymin,ymax))

  r <- raster(t(vals),
              xmn=xmin, xmx=xmax,
              ymn=ymin, ymx=ymax)
  #str(r)
  plot(r)
  title(date_i)
  #image(r)
  #e <- extract(r, shp, df=T)
  #str(e) 
  #e1 <- shp@data
  #e1$date <- date_i
  #e1$values <- e[,2]
  #dat_out <- rbind(dat_out, as.data.frame(e1))
}
## [1] 112.905 153.995
## [1] -43.735  -9.005
## [1] 112.905 153.995
## [1] -43.735  -9.005
## [1] 112.905 153.995
## [1] -43.735  -9.005
## [1] 112.905 153.995
## [1] -43.735  -9.005

plot of chunk unnamed-chunk-1

#dat_out
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rgdal_0.9-1   raster_2.3-12 sp_1.0-16     ncdf_1.6.8    knitr_1.8    
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5  formatR_1.0     grid_3.0.1      lattice_0.20-15
## [5] stringr_0.6.2   tools_3.0.1

Posted in  extreme weather events


hotwire-a-morpho-package-using-r-and-xml-editor

Introduction

  • at my workplace we have had a significant number of issues trying to use Morpho to publish ecological data
  • I have been interested in the R EML package as an alternative and got this advice from the authors
  • “I’m really happy to see you combining these tools and making best use of each of their features (authoring for Morpho, automation for REML). We should think more about how to make this combination seamless.” https://github.com/ropensci/EML/issues/93
  • That dream still seems a long way off (and I suspect will see a re-write of morpho in R’s “shiny apps” language)
  • but here is an attempt to circumvent some of our main issues which seem to do with using Morpho to do anything with data tables
  • (morpho seems fine when just dealing with the documentation parts of EML)

Create a Minimal Morpho EML

  • I am going to use Morpho to create a new data package and just step thru all the steps, add TBA to all the options
  • another option was to create this using the REML example
  • but then there is uncertainty if it will be accepted easily by Morpho when imported
  • seriously, just TBA and only for the required fields (these are in red)
  • take note of the docid number, and save locally
  • now we can import this as our template. lets save it as a reference file

Code:

cp ~/.morpho/profiles/hanigan/data/hanigan/XX.1 ~/tools/morpho_template_eml.xml

Import to morpho

  • this skeletal eml can now be imported in the File menu / import
  • this is the ‘hotwired’ starting point of our morpho packages
  • I want to do this for others too. so change morpho profile on your computer and import the xml
  • gives a new package with a new morpho generated docid number
  • add the package title, with enough identification in the title to allow quick reference
  • SET THE ACCESS TO NOT PUBLIC BEFORE SAVING

generate the EML Skeleton using the R EML package

Code:

library(EML)
library(devtools)
install_github("disentangle", "ivanhanigan")
unit_defs <- reml_boilerplate(dat,titl = "myFile")
# you just got a quick and dirty unit_defs, these need to be made proper in morpho
# we can get the col names easily
col_defs <- names(dat)
# then create a dataset with metadata
ds <- data.set(dat,
               col.defs = col_defs,
               unit.defs = unit_defs
               )
# now write EML metadata file
eml_config(creator="TBA <fakeaddress@gmail.com>")
wd <- getwd()
setwd(outdir)
eml_write(ds,
          file = gsub(".csv", "_eml_skeleton.xml", outfile),
          title = gsub(".csv", "", outfile)
          )
tempfile <- dir(pattern="^data_table_")
file.rename(tempfile, outfile)
# rename the CSV file.
setwd(wd)

now attach the data

  • to avoid risk of morpho getting confused it is probably safe to let it start the new dataTable tags

morpho-hotwire-1.png

  • say it is a simple CSV
  • say the name for the file for title, TBA for all other required fields, for attributes add one and say it is datetime (to be quick)

morpho-hotwire-2.png

  • note the docid (41.2) and download file id (ecogrid://knb/datalibrarian2.42.1)
  • save locally and close

now replace the metadata info

  • use an XML editor to go to the eml skeleton we created in morpho and also to the one created in R
  • in the morpho generated XML find the attributeList tag within the dataTable group, mine looked like

Code:

<attributeList><attribute id="1408536821935">...

  • and we know that the stuff we want to replace finishes with the closing tag </attributeList>
  • I think the ID is irrelevant, except Morpho won’t allow multiple of the same id number within an EML
  • find the same and replace in the REML generated bit

morpho-attributelist.png

And paste:

morpho-dataformat.png

Also do the

<dataFormat></dataFormat> section

add another dataset

  • assuming you already ran reml_boilerplate for this new file
  • use morpho menus to add another datatabe
  • save locally and close morpho
  • open with XML editor and the REML EML Skeleton
  • transfer dataformat and attributelist stuff
  • go to documentation and make minor mod (space in abstract?), save locally

share it with colleagues

  • go to documentation menu > access info and set/verify to public = no add user
  • save to metacat
  • now go to their (windows?) machine and log in as the same morpho profile,
  • log in to the metacat and download the new package for work

Posted in  Data Documentation


morpho-doesnt-respect-quote-encapsulated-strings-in-csv

  • We discovered this known morpho bug when we try to import a textual data to morpho
  • Even if you double check that you have selected the correct format for that data. For example, delimited or fixed width, and the correct delimiter.
  • see this bug report Bug #4636 Morpho should ignore commas inside double-quoted fields (CSV import)
  • this unwanted Morpho behaviour is due to a comma in the cell and morpho doesn’t respect quote encapsulated strings with commas in them.
  • we do not think we can fix Morpho’s bug, and we cannot upgrade Morpho because of our Metacat version. We just need to find a workaround on that.

morpho-quote-encapsulated.png

Posted in  Data Documentation