Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

dc-uploader-and-ANU-DataCommons

In this post I use the tool produced at the ANU by the DataCommons team. This requires Python3.

What does it do?

The script only creates new collection records. The functionality to edit records didn’t make it into the script as the expectation is that automated ingests will only require creation of new datasets to which files will be uploaded.

Users can feel free to tweak the collection parameter file to their liking in the development environment until happy with the results.

Create the metadata.txt

You need to get the python scripts and conf file from the ANU DataCommons team. Store these somewhere handy and move to that directory.

change the anudc.conf: to test out the scripts by creating some sample records, please uncomment the “host” field in the file that points to dc7-dev2.anu.edu.au:8443 , and comment out the one that points to datacommons.anu.edu.au:8443.

Also you get a different token in dev and prod servers for security reasons you cannot use the same token. Also, storing your username and password in plain text is not recommended and is to be used only for debugging purposes. Also, in my case I had to change the owner group to ‘5’ when creating records in dev. In prod, it’s 6.

You can look int the “Keys.txt” file that contains the full list of values that can be specified in this metadata.txt file.

Code:

setwd("~/tools/dcupload")
sink("metadata.txt")
cat("
# This file, referred to as a collection parameter file, consists of
# data in key=value pairs. This data is sent to the ANU Data Commons
# to create a collection, establish relations with other records,
# and/or upload files to those collections.
 
# The metadata section consists of metadata for use in creation (not
# for modification) of record metadata in ANU Data Commons. The
# following fields are required for the creation of a record. The file
# Keys.txt contains the full list of values that can be specified in
# this file. Based on this information below, a collection record of
# type databaset with the title "Test Collection 6/05/2013" will be
# created owned by Meteorology and Health group.
[metadata]
type = Collection
subType = dataset
ownerGroup = 5
# 6 on production, 5 on dev
name = Civil Status, Gender and Activity Sector
briefDesc = An example, fictional dataset for Decision Tree Models
citationCreator = Ritschard, G. (2006). Computing and using the deviance with classification trees. In Compstat 2006 - Proceedings in Computational Statistics 17th Symposium Held in Rome, Italy, 2006.
email = ivan.hanigan@anu.edu.au
anzforSubject = 1601
 
# The relations section allows you to specify the relation this record
# has with other records in the system.  Currently relations with NLA
# identifiers is not supported.
[relations]
isOutputOf = anudc:123
 
# This section contains a line of the form 'pid = anudc:123' once a
# record has been created so executing the uploader script with the
# same collection parameter file doesnt create a new record with the
# same metadata.
[pid]
")
sink()

# run the dcload
system("python3 dcuploader.py -c metadata.txt")

What happened?

  • Looking in the metadata.txt file it now has a pid like “pid = test:3527”
  • And we have created a new record in our account on the DataCommons server.

go to the website

Now go to the dev site and you can continue editing the record manually in the browser.

Or if we have ironed out the wrinkles you could go straight to the production server at This Link

Uploading the data

The dataset gets sent using a Java applet in the browser while you are manually editing the record using the browser.

Notes

  • After the records get created, the script tries to relate the record to other records as you’ve specified in the collection parameter file in the relations section. If you’re creating a record in dev2, you cannot relate it to a record in production because that record doesn’t exist in dev2. Remember that IDs for records in dev environments have the prefix “test:” while those in production have “anudc:”.

  • Also, when you ran the script against production the created records were linked with the record with the ID anudc:123. I have now removed those relations. You might want to change that value in your metadata.txt file so the links are established to records that created records actually can be related to. Or for testing purposes, simply delete the entire [relations] section.

Posted in  Data Documentation


reml-and-rfigshare-part-2

In the last post I explored the functionality of reml. This time I will try to send data to figshare.

  • First follow These Instructions to get rfigshare set up. In particular store your figshare credentials in ~/.Rprofile

Code:reml-and-rfigshare-part-2

# func
require(devtools)
install_github("reml", "ropensci")
require(reml)
install_github("rfigshare", "ropensci")
require(rfigshare)
install_github("disentangle", "ivanhanigan")
require(disentangle)
# load
fpath <- system.file(file.path("extdata","civst_gend_sector_eml.xml"), package = "disentangle")
setwd(dirname(fpath))
obj <- eml_read(fpath)
# clean
obj
# do

## STEP 1: find one of the preset categories
# available. We can ask the API for
# a list of all the categories:
list <- fs_category_list()
list[grep("Survey", list)]

## STEP 2: PUBLISH TO FIGSHARE
id <- eml_publish(fname,
                  description="Example EML
                    A fictional dataset",
                  categories = "Survey results",
                  tags = "EML",
                  destination="figshare"
                  )
# there are several warnings
# but go to figshare and it has sent the metadata and data OK

# make public using either the figshare web interface, the
# rfigshare package (using fs_make_public(id)) or just by adding
# the argument visibility = TRUE to the above eml_publish
fs_make_public(id)

Now these data are on figshare

Now I have published the data they are visible and have a DOI

Posted in  Data Documentation


data-documentation-case-study-reml-and-rfigshare

Case Study: reml-and-rfigshare

First we will look at the work of the ROpenSci team and the reml package. In the vignette they show how to publish data to figshare using rfigshare package. figshare is a site where scientists can share datasets/figures/code. The goals are to encourage researchers to share negative results and make reproducible research efforts user-friendly. It also uses a tagging system for scientific research discovery. They give you unlimited public space and 1GB of private space.

Start by getting the reml package.

Code:

# func
require(devtools)
install_github("reml", "ropensci")
require(reml)
?eml_write

This is the Top-level API function for writing eml. Help page is a bit sparse. See This Link for more. For eg “for convenience, dat could simply be a data.frame and reml will launch it’s metadata wizard to assist in constructing the metadata based on the data.frame provided. While this may be helpful starting out, regular users will find it faster to define the columns and units directly in the format above.”

Now load up the test data for classification trees I described in This Post

Code:

install_github("disentangle", "ivanhanigan") # for the data
                                             # described in prev post

# load
fpath <- system.file(file.path("extdata", "civst_gend_sector.csv"),
                     package = "disentangle"
                     )
civst_gend_sector <- read.csv(fpath)

# clean
str(civst_gend_sector)

# do
eml_write(civst_gend_sector,
          creator = "Ivan Hanigan <ivanhanigan@gmail.com>")


          


# Starts up the wizard, a section is shown below.  The wizard
# prompts in the console and the user writes the answer.

# Enter description for column 'civil_status':
#  marriage status
# column civil_status appears to contain categorical data.
#  
# Categories are divorced/widowed, married, single
#  Please define each of the categories at the prompt
# define 'divorced/widowed':
# was once married
# define 'married':
# still married
# define 'single':
# never married

# TODO I don't really know what activity_sector is.  I assumed
# school because Categories are primary, secondary, tertiary.

# this created "metadata.xml" and "metadata.csv"
file.remove(c("metadata.xml","metadata.csv"))

This was a very minimal data documentation effort. A bit more detail would be better. Because I would now need to re-write all that in the wizard I will take the advice of the help file that “regular users will find it faster to define the columns and units directly in the format”

Code:

ds <- data.set(civst_gend_sector,
               col.defs = c("Marriage status", "sex", "education", "counts"),
               unit.defs = list(c("was once married","still married","never married"),
                   c("women", "men"),
                   c("primary school","secondary school","tertiary school"),
                   c("persons"))
               )
ds
# this prints the dataset and the metadata
# now run the EML function
eml_write(ds, 
          title = "civst_gend_sector",  
          description = "An example, fictional dataset for Decision Tree Models",
          creator = "Ivan Hanigan <ivanhanigan@gmail.com>",
          file = "inst/extdata/civst_gend_sector_eml.xml"
          )
# this created the xml and csv with out asking anything
# but returned a
## Warning message:
## In `[<-.data.frame`(`*tmp*`, , value = list(civil_status = c(2L,  :
##   Setting class(x) to NULL;   result will no longer be an S4 object

# TODO investigate this?

# now we can access the local EML
obj <- eml_read("inst/extdata/civst_gend_sector_eml.xml")
obj 
str(dataTable(obj))
# returns an error
## Error in plyr::compact(lapply(slotNames(from), function(s) if (!isEmpty(slot(from,  (from attribute.R#300) : 
##   subscript out of bounds

Conclusions

So this looks like a useful tool. Next steps are to:

  • look at sending these data to figshare
  • describe a really really REALLY simple workflow (3 lines? create metadata, eml_write, push to figshare)

Posted in  Data Documentation


two-main-types-of-data-documentation-workflow

This post introduces a new series of blog posts in which I want to experiment with a few tools for data documentation, which I’ll present as Case Studies. This series of posts will be pitched to an audience mixture of data librarians and data analysts.

Data documentation occurs in a spectrum from simple notes through to elaborate systems. I’ve been working on a conceptual framework about how the actual process can be done in two distinct ways:

  • Graphical User Interface (GUI) solutions
  • Programmatic (Scripted/Automagic) solutions

I think the GUI tools are in general pretty user friendly and useful for simple projects with only a small number of datasets, but have a major drawback for the challenge of heterogeneous data integration. I think the problem is expressed nicely In This Post By Carl Boettiger in reference to Morpho:

  • “looks like a rather useful if tedious tool for generating EML files. Unfortunately, without the ability to script inputs or automatically detect existing data structures, we are forced through the rather arduous process of adding all metadata annotation each time….”
  • “…A package could also provide utilities to generate EML from R objects, leveraging the metadata implicit in R objects that is not present in a CSV (in which there is no built-in notion of whether a column is numeric or character string, what missing value characters it uses, or really if it is consistent at all. Avoiding manual specification of these things makes the metadata annotation less tedious as well.”

Centralised Repository, Distributed Users

A key aspect of current approaches is the existence of a centralised data management system. All the examples I consider include at least a metadata catalogue and some also include a data repository. An additional feature sometimes exists for managing users permissions.

The relationship between users and centralised services is a really complicated space, but essentially consists of the ability for users to create the documentation and push it (perhaps along with the data) to the metadata catalogue and/or repository. So given these assumptions I propose the following types of arrangement:

  • user sends metadata to metadata catalogue
  • user sends metadata and data to metadata catalogue and data repository
  • user sends metadata and data and permissions information to metadata catalogue and data repository and permissions system.

The Case Studies I’ve identified that I want to explore are listed below, names follow the format ‘client tool’-and-‘data repository or metadata catalogue’-and-optionally-‘permissions system’:

Programmatic solutions

  • reml-and-rfigshare
  • reml-and-knb (when/if this becomes available)
  • make_ddixml-and-ddiindex-and-orapus
  • r2ddi-ddiindex
  • dc-uploader-and-ANU-DataCommons
  • dc-uploader-and-RDA

Graphical User Interface solutions

  • morpho-and-knb-metacat
  • nesstar-publisher-and-nesstar-and-whatever-Steve-calls-the-ADA-permissions-system
  • xmet-and-Australian-Spatial-Data-Directory
  • sdmx-editor-and-sdmx-registry

Posted in  Data Documentation


wickhams-tidy-tools-only-get-you-90-pct-the-way

Hadley Wickham’s tidy tools

In this video at 8 mins 50 seconds he says “these four tools do 90% of the job”

  • subset,
  • transform,
  • summarise, and
  • arrange
  • TODO I noticed at the website for an Rstudio course transform has been replaced by mutate as one of the “four basic verbs of data manipulation”.

Tidy Data from Drew Conway on Vimeo.

So I thought what’s the other 10? Here’s a few contenders for my work:

  • merge
  • reshape::cast and reshape::melt
  • unlist
  • t() transpose
  • sprintf or paste

R-subset

# Filter rows by criteria
subset(airquality, Temp > 90, select = c(Ozone, Temp))

## NB This is a convenience function intended for use interactively.  For
## programming it is better to use the standard subsetting functions like
## ‘[’, and in particular the non-standard evaluation of argument
## ‘subset’ can have unanticipated consequences.

with(airquality,
     airquality[Temp > 90, c("Ozone", "Temp")]
     )

# OR

airquality[airquality$Temp > 90,  c("Ozone", "Temp")] #### R-transform
# New columns that are functions of other columns       
df <- transform(airquality,
                new = -Ozone,
                Temp2 = (Temp-32)/1.8
                )
head(df) #### R-mutate
require(plyr)
# same thing as transform
df <- mutate(airquality, new = -Ozone, Temp = (Temp - 32) / 1.8)    
# Things transform can't do
df <- mutate(airquality, Temp = (Temp - 32) / 1.8, OzT = Ozone / Temp)

# mutate is rather faster than transform
system.time(transform(baseball, avg_ab = ab / g))
system.time(mutate(baseball, avg_ab = ab / g)) #### R-summarise
# New data.frame where columns are functions of existing columns
require(plyr)    
df <- ddply(.data = airquality,
            .variables = "Month",
            .fun = summarise,
            tmax = max(Temp),
            tav = mean(Temp),
            ndays = length(unique(Day))
            )
head(df)

Passing variables to ddply for summary

# Notice how the name of the variable Temp doesn't need quotes?
# this means that you need to hard code the names
# But if you want to pass variables to this inside a function we need a
# different approach.

summarise_df  <- function(x, by, var1, var2, var3)
  {
    data_out <- ddply(x,
                      by,
                      function(df) return(
                        c(
                          tmax = max(df[,var1]),
                          tav = mean(df[,var2]),
                          ndays = length(unique(df[,var3]))
                          )
                        )
                      )
    return(data_out)
  }

df2 <- summarise_df(x = airquality, by = "Month",
                   var1 = "Temp", var2 = "Temp", var3 = "Day"
                   )

head(df2)
all.equal(df,df2)
# TRUE

Another alternative, if we want to pass the dataset as string too

summarise_df2  <- function(x, by, var1, var2, var3)
  {
    data_out <- eval(
      parse(
        text =
        sprintf(
          "ddply(.data = %s,
            .variables = '%s',
            .fun = summarise,
            tmax = max(%s),
            tav = mean(%s),
            ndays = length(unique(%s))
            )", x, by, var1, var2, var3
          )
        )
      )
    return(data_out)
  }

df3 <- summarise_df2(x = "airquality", by = "Month",
                     var1 = "Temp", var2 = "Temp", var3 = "Day"
                     )
head(df3)
all.equal(df, df3)
# TRUE #### R-arrange
# Re-order the rows of a data.frame
df <- arrange(airquality, Temp, Ozone)
head(df)

Posted in  research methods