Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

morpho-and-rfigshare

In this Case Study I will use Morpho to compare directly with reml.

Step one: Set up morpho

  • Follow the instructions at the ASN SuperSite website and install Morpho 1.8 rather than latest version because it has technical issues that stop it from setting permissions.
  • Configure morpho. (I will follow the ASN SuperSite instructions as a future Case Study will be to use their KNB Metacat service).
  • Do not configure to connect to the Metacat repository, will need a password to be assigned by ASN data manager.

Step 2: Look at the REML created metadata using Morpho

  • Morpho offers to open existing sets for modification.

Code: get location of my example dataset

require(disentangle)
fpath <- system.file(file.path("extdata", "civst_gend_sector.csv"), package="disentangle")
fpath
dirname(fpath)
# [1] "/home/ivan_hanigan/Rlibs/disentangle/extdata"
  • Morpho > File > import = civst_gend_sector_eml.xml
  • (not the figshare_civst_gend_sector_eml.xml that was created when sending to figshare)
  • Error encountered. could not open metadata, open empty data package. Offered to upgrade (unable to edit > accepted)
  • unable to display data, empty data package will be shown
  • top menu > Documentation > Add/Edit ion

    Step 3: Create new datasets with Morpho

Posted in  Data Documentation


dc-uploader-and-ANU-DataCommons

In this post I use the tool produced at the ANU by the DataCommons team. This requires Python3.

What does it do?

The script only creates new collection records. The functionality to edit records didn’t make it into the script as the expectation is that automated ingests will only require creation of new datasets to which files will be uploaded.

Users can feel free to tweak the collection parameter file to their liking in the development environment until happy with the results.

Create the metadata.txt

You need to get the python scripts and conf file from the ANU DataCommons team. Store these somewhere handy and move to that directory.

change the anudc.conf: to test out the scripts by creating some sample records, please uncomment the “host” field in the file that points to dc7-dev2.anu.edu.au:8443 , and comment out the one that points to datacommons.anu.edu.au:8443.

Also you get a different token in dev and prod servers for security reasons you cannot use the same token. Also, storing your username and password in plain text is not recommended and is to be used only for debugging purposes. Also, in my case I had to change the owner group to ‘5’ when creating records in dev. In prod, it’s 6.

You can look int the “Keys.txt” file that contains the full list of values that can be specified in this metadata.txt file.

Code:

setwd("~/tools/dcupload")
sink("metadata.txt")
cat("
# This file, referred to as a collection parameter file, consists of
# data in key=value pairs. This data is sent to the ANU Data Commons
# to create a collection, establish relations with other records,
# and/or upload files to those collections.
 
# The metadata section consists of metadata for use in creation (not
# for modification) of record metadata in ANU Data Commons. The
# following fields are required for the creation of a record. The file
# Keys.txt contains the full list of values that can be specified in
# this file. Based on this information below, a collection record of
# type databaset with the title "Test Collection 6/05/2013" will be
# created owned by Meteorology and Health group.
[metadata]
type = Collection
subType = dataset
ownerGroup = 5
# 6 on production, 5 on dev
name = Civil Status, Gender and Activity Sector
briefDesc = An example, fictional dataset for Decision Tree Models
citationCreator = Ritschard, G. (2006). Computing and using the deviance with classification trees. In Compstat 2006 - Proceedings in Computational Statistics 17th Symposium Held in Rome, Italy, 2006.
email = ivan.hanigan@anu.edu.au
anzforSubject = 1601
 
# The relations section allows you to specify the relation this record
# has with other records in the system.  Currently relations with NLA
# identifiers is not supported.
[relations]
isOutputOf = anudc:123
 
# This section contains a line of the form 'pid = anudc:123' once a
# record has been created so executing the uploader script with the
# same collection parameter file doesnt create a new record with the
# same metadata.
[pid]
")
sink()

# run the dcload
system("python3 dcuploader.py -c metadata.txt")

What happened?

  • Looking in the metadata.txt file it now has a pid like “pid = test:3527”
  • And we have created a new record in our account on the DataCommons server.

go to the website

Now go to the dev site and you can continue editing the record manually in the browser.

Or if we have ironed out the wrinkles you could go straight to the production server at This Link

Uploading the data

The dataset gets sent using a Java applet in the browser while you are manually editing the record using the browser.

Notes

  • After the records get created, the script tries to relate the record to other records as you’ve specified in the collection parameter file in the relations section. If you’re creating a record in dev2, you cannot relate it to a record in production because that record doesn’t exist in dev2. Remember that IDs for records in dev environments have the prefix “test:” while those in production have “anudc:”.

  • Also, when you ran the script against production the created records were linked with the record with the ID anudc:123. I have now removed those relations. You might want to change that value in your metadata.txt file so the links are established to records that created records actually can be related to. Or for testing purposes, simply delete the entire [relations] section.

Posted in  Data Documentation


reml-and-rfigshare-part-2

In the last post I explored the functionality of reml. This time I will try to send data to figshare.

  • First follow These Instructions to get rfigshare set up. In particular store your figshare credentials in ~/.Rprofile

Code:reml-and-rfigshare-part-2

# func
require(devtools)
install_github("reml", "ropensci")
require(reml)
install_github("rfigshare", "ropensci")
require(rfigshare)
install_github("disentangle", "ivanhanigan")
require(disentangle)
# load
fpath <- system.file(file.path("extdata","civst_gend_sector_eml.xml"), package = "disentangle")
setwd(dirname(fpath))
obj <- eml_read(fpath)
# clean
obj
# do

## STEP 1: find one of the preset categories
# available. We can ask the API for
# a list of all the categories:
list <- fs_category_list()
list[grep("Survey", list)]

## STEP 2: PUBLISH TO FIGSHARE
id <- eml_publish(fname,
                  description="Example EML
                    A fictional dataset",
                  categories = "Survey results",
                  tags = "EML",
                  destination="figshare"
                  )
# there are several warnings
# but go to figshare and it has sent the metadata and data OK

# make public using either the figshare web interface, the
# rfigshare package (using fs_make_public(id)) or just by adding
# the argument visibility = TRUE to the above eml_publish
fs_make_public(id)

Now these data are on figshare

Now I have published the data they are visible and have a DOI

Posted in  Data Documentation


data-documentation-case-study-reml-and-rfigshare

Case Study: reml-and-rfigshare

First we will look at the work of the ROpenSci team and the reml package. In the vignette they show how to publish data to figshare using rfigshare package. figshare is a site where scientists can share datasets/figures/code. The goals are to encourage researchers to share negative results and make reproducible research efforts user-friendly. It also uses a tagging system for scientific research discovery. They give you unlimited public space and 1GB of private space.

Start by getting the reml package.

Code:

# func
require(devtools)
install_github("reml", "ropensci")
require(reml)
?eml_write

This is the Top-level API function for writing eml. Help page is a bit sparse. See This Link for more. For eg “for convenience, dat could simply be a data.frame and reml will launch it’s metadata wizard to assist in constructing the metadata based on the data.frame provided. While this may be helpful starting out, regular users will find it faster to define the columns and units directly in the format above.”

Now load up the test data for classification trees I described in This Post

Code:

install_github("disentangle", "ivanhanigan") # for the data
                                             # described in prev post

# load
fpath <- system.file(file.path("extdata", "civst_gend_sector.csv"),
                     package = "disentangle"
                     )
civst_gend_sector <- read.csv(fpath)

# clean
str(civst_gend_sector)

# do
eml_write(civst_gend_sector,
          creator = "Ivan Hanigan <ivanhanigan@gmail.com>")


          


# Starts up the wizard, a section is shown below.  The wizard
# prompts in the console and the user writes the answer.

# Enter description for column 'civil_status':
#  marriage status
# column civil_status appears to contain categorical data.
#  
# Categories are divorced/widowed, married, single
#  Please define each of the categories at the prompt
# define 'divorced/widowed':
# was once married
# define 'married':
# still married
# define 'single':
# never married

# TODO I don't really know what activity_sector is.  I assumed
# school because Categories are primary, secondary, tertiary.

# this created "metadata.xml" and "metadata.csv"
file.remove(c("metadata.xml","metadata.csv"))

This was a very minimal data documentation effort. A bit more detail would be better. Because I would now need to re-write all that in the wizard I will take the advice of the help file that “regular users will find it faster to define the columns and units directly in the format”

Code:

ds <- data.set(civst_gend_sector,
               col.defs = c("Marriage status", "sex", "education", "counts"),
               unit.defs = list(c("was once married","still married","never married"),
                   c("women", "men"),
                   c("primary school","secondary school","tertiary school"),
                   c("persons"))
               )
ds
# this prints the dataset and the metadata
# now run the EML function
eml_write(ds, 
          title = "civst_gend_sector",  
          description = "An example, fictional dataset for Decision Tree Models",
          creator = "Ivan Hanigan <ivanhanigan@gmail.com>",
          file = "inst/extdata/civst_gend_sector_eml.xml"
          )
# this created the xml and csv with out asking anything
# but returned a
## Warning message:
## In `[<-.data.frame`(`*tmp*`, , value = list(civil_status = c(2L,  :
##   Setting class(x) to NULL;   result will no longer be an S4 object

# TODO investigate this?

# now we can access the local EML
obj <- eml_read("inst/extdata/civst_gend_sector_eml.xml")
obj 
str(dataTable(obj))
# returns an error
## Error in plyr::compact(lapply(slotNames(from), function(s) if (!isEmpty(slot(from,  (from attribute.R#300) : 
##   subscript out of bounds

Conclusions

So this looks like a useful tool. Next steps are to:

  • look at sending these data to figshare
  • describe a really really REALLY simple workflow (3 lines? create metadata, eml_write, push to figshare)

Posted in  Data Documentation


two-main-types-of-data-documentation-workflow

This post introduces a new series of blog posts in which I want to experiment with a few tools for data documentation, which I’ll present as Case Studies. This series of posts will be pitched to an audience mixture of data librarians and data analysts.

Data documentation occurs in a spectrum from simple notes through to elaborate systems. I’ve been working on a conceptual framework about how the actual process can be done in two distinct ways:

  • Graphical User Interface (GUI) solutions
  • Programmatic (Scripted/Automagic) solutions

I think the GUI tools are in general pretty user friendly and useful for simple projects with only a small number of datasets, but have a major drawback for the challenge of heterogeneous data integration. I think the problem is expressed nicely In This Post By Carl Boettiger in reference to Morpho:

  • “looks like a rather useful if tedious tool for generating EML files. Unfortunately, without the ability to script inputs or automatically detect existing data structures, we are forced through the rather arduous process of adding all metadata annotation each time….”
  • “…A package could also provide utilities to generate EML from R objects, leveraging the metadata implicit in R objects that is not present in a CSV (in which there is no built-in notion of whether a column is numeric or character string, what missing value characters it uses, or really if it is consistent at all. Avoiding manual specification of these things makes the metadata annotation less tedious as well.”

Centralised Repository, Distributed Users

A key aspect of current approaches is the existence of a centralised data management system. All the examples I consider include at least a metadata catalogue and some also include a data repository. An additional feature sometimes exists for managing users permissions.

The relationship between users and centralised services is a really complicated space, but essentially consists of the ability for users to create the documentation and push it (perhaps along with the data) to the metadata catalogue and/or repository. So given these assumptions I propose the following types of arrangement:

  • user sends metadata to metadata catalogue
  • user sends metadata and data to metadata catalogue and data repository
  • user sends metadata and data and permissions information to metadata catalogue and data repository and permissions system.

The Case Studies I’ve identified that I want to explore are listed below, names follow the format ‘client tool’-and-‘data repository or metadata catalogue’-and-optionally-‘permissions system’:

Programmatic solutions

  • reml-and-rfigshare
  • reml-and-knb (when/if this becomes available)
  • make_ddixml-and-ddiindex-and-orapus
  • r2ddi-ddiindex
  • dc-uploader-and-ANU-DataCommons
  • dc-uploader-and-RDA

Graphical User Interface solutions

  • morpho-and-knb-metacat
  • nesstar-publisher-and-nesstar-and-whatever-Steve-calls-the-ADA-permissions-system
  • xmet-and-Australian-Spatial-Data-Directory
  • sdmx-editor-and-sdmx-registry

Posted in  Data Documentation