Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

guest-post-by-marco-fahmi-why-morpho

I asked my colleague Marco Fahmi to post this as a guest post. It came my way as a part of an email exchange in which another colleague had a question regarding their task of recording previous work in an ecological field study:

ideally in such a way that we would end up with a complete metadata
profile of previous work carried out. I would also like to see the
establishment of a system that could also be used by the group going
forward to keep track of information and data produced at the
site. I hear you have been involved with establishing the
a metadata reporting system. How does this effort currently stand,
is it online? I was also wondering if you would be amenable to sharing
what you have with us with the hope that i could use this as a model
for our own system.

Our colleague Sheila responded first that:

All metadata are created using a standard template document and then transfered into a software package called Morpho. Metadata are then uploaded with the data to the Australian Supersite Network (ASN) Portal http://www.tern-supersites.net.au/knb/. Any other useful documents are uploaded to the website http://www.tern-supersites.net.au either under the specific supersite tab on the left hand menu or under the Publications - Resources for SuperSite Users tab.

Marco says:

Morpho is an open source piece of software designed to host all kinds of ecological data. More information about how we use it at ASN can be found here: http://www.tern-supersites.net.au/index.php/data/repository-tutorial

Morpho should be enough for an individual researcher to organise and describe their personal data collection. If you want to share the data with colleagues or publish them online, then you will also need Metacat. There is a worldwide Metacat server available from the link account. All you need to do is request an account and connect to it via Morpho. Alternatively you can set up your own; but then you will need your own server and tech knowhow to configure it (and maintain it).

For technical reasons, we have our own server running an older version of the Metacat software. You are welcome to use it if you wish (Shiela can issue you an account to log in and upload). We are also happy to provide assistance if you want to set something a standalone server for DRO. (Like any other piece of infrastructure, someone will need to look after the server after it is set up, so that’s probably a decision that will need to be considered carefully).

My comment:

"Morpho should be enough for an individual researcher to organise
and describe their personal data collection"

I agree, but would emphasise the should and then say but … Ultimately I’d like to see something as easy and intuitive as iTunes for music or Endnote and Mendeley for bibliographies, but…

Posted in  research methods


a-sharp-looking-orgmode-latex-export-header

  • I got this header for a nice looking report from Bull, G. (2011). Example Sweave Document. SharpStatistics.co.uk.
  • This has a bunch of useful parameters, but I just really like the header and footer on pages 2 onward
  • The original was a Sweave file. I really like Sweave but orgmode allows other languages as well as R to be inter-woven into the script
  • An alternative to Sweave is knitr and is still on my todo list, but this works well at the moment
  • I also like how you can quickly change this to a Beamer presentation style.
  • Once this is in your file use C-c C-e d to export and compile the PDF
  • This example is available at this link

Emacs orgmode Code: Put this into your .org file

#+TITLE: Sharp Report Template
#+AUTHOR: Ivan Hanigan
#+email: ivan.hanigan@anu.edu.au
#+LaTeX_CLASS: article
#+LaTeX_CLASS_OPTIONS: [a4paper]
#+LaTeX_HEADER: \usepackage{amssymb,amsmath}
#+LaTeX_HEADER: \usepackage{fancyhdr} %For headers and footers
#+LaTeX_HEADER: \pagestyle{fancy} %For headers and footers
#+LaTeX_HEADER: \usepackage{lastpage} %For getting page x of y
#+LaTeX_HEADER: \usepackage{float} %Allows the figures to be positioned and formatted nicely
#+LaTeX_HEADER: \floatstyle{boxed} %using this
#+LaTeX_HEADER: \restylefloat{figure} %and this command
#+LaTeX_HEADER: \usepackage{url} %Formatting of yrls
#+LaTeX_HEADER: \lhead{ivanhanigan.github.com}
#+LaTeX_HEADER: \chead{}
#+LaTeX_HEADER: \rhead{\today}
#+LaTeX_HEADER: \lfoot{Draft}
#+LaTeX_HEADER: \cfoot{}
#+LaTeX_HEADER: \rfoot{\thepage\ of \pageref{LastPage}}
#+LATEX: \tableofcontents

* Introduction
This is a sharp looking report template I got from an R blogger \cite{Bull2011}.

The pages after the first page have a nice looking header, footer and page number.
\clearpage

* Section 1
In the Org file you can see some hidden R code that computes a linear regression and returns the results shown in Table \ref{ATable}.
\input{ATable.tex}
\clearpage
*** COMMENT some-code
#+name:some-code
#+begin_src R :session *R* :tangle no :exports none :eval yes
#### name:some-code ####
x<-rnorm(100,10,5)
y<-rnorm(100,20,15)
fit <- lm(y~x)
library(xtable)
sink("ATable.tex")
xtable(fit, caption="Example Table",digits=4,table.placement="H",label="ATable")
sink()
#+end_src

#+RESULTS: some-code



* References
\bibliographystyle{apalike}
\bibliography{/home/ivan_hanigan/references/library}

Posted in  research methods


Setting Up A Workflow Script With Code Chunks

This post describes some ideas and techniques I use to set up a “workflow script”. I use this term to refer to the structured combination of code, data and narrative that make an executable Reproducible Research Report (RRR).

A lot of these ideas are inpsired by a great paper by Kieran Healy called “Choosing Your Workflow Applications” available at https://github.com/kjhealy/workflow-paper to accompany his Emacs Starter Kit. My shortened version of his main points are:

  • 1 use a good code editor
  • 2 analyse data with scripts
  • 3 store your work simply and document it properly
  • 4 use a version control system
  • 5 Automate back ups
  • 6 Avoid distracting gadgets

Here’s my current approach in each of these categories

  • 1 use Emacs with Orgmode (and kjhealy’s drop-in set of useful defaults)
  • 2 Scripts that utilise the literate programming technique of mixing Code Chunks in with descriptive prose
  • 3 John Myles White’s ProjectTemplate R Package and Josh Riech’s LCFD paradigm
  • 4 git and GitHub for version control

5 Automated Backups and 6 Avoiding Gadgets are still somethings I find challenging

1 Use a good code editor

I like using Emacs with Orgmode.

2 Analyse data with Scripts (stitch together code chunks)

I use Scripts but prefer to think of them as stitched together Code Chunks with prose into Compendia.

  • Compendia are documents that weave together Code and Prose into an executable report
  • The underlying philosophy is called Reproducible Research Reports
  • A very useful tool is a keyboard shortcut to quickly create a chunk for code
  • so you can be writing parts of the report like this: “Blah Blah Blah as shown in Figure X and Table Y”
  • then just hit the correct keys and WHAMM-O there is a new chunk ready for the code that creates Figure X and Table Y to be written.
  • Here is how I use Emacs to achieve this (the other editors I mentioned above also have the abiltiy to do this too). The IPython Notebook does this stuff too but calls chunks “cells” for some reason.

Emacs Code: Put this into the ~/.emacs.d/init.el file

(define-skeleton chunk-skeleton
  "Info for a code chunk."
  "Title: "
  "*** " str "-code\n"
  "#+name:" str "\n"
  "#+begin_src R :session *R* :tangle src/" str ".r :exports reports :eval no\n"
  "#### name:" str " ####\n"
  "\n"
  "#+end_src\n"
)
(global-set-key [?\C-x ?\C-\\] 'chunk-skeleton)

Using the Emacs Shortcut

  • now whenever you type Control-x control-\ a new code chunk will appear
  • you’ll be typing “blah blah blah” and think I need a figure or table, just hit it.
  • move into the empty section and add some code
  • you can hit C-c ‘ to enter a org-babel code execution session that will be able to send these line by line to an R session
  • or within the main org buffer if your eval flag is set to yes then you can run the entire chunk (and return tabular output to the doc) using C-c C-c
  • To export the code chunks and create the modular code scripts without the narrative prose use C-c C-v t
  • this is called “tangling” and the chunks will be written out to the file specified in the chunk header “:tangle” flag

Compiling the resulting Compendium

  • Emacs uses LaTeX or HTML to produce the Report
  • I find both of these outputs very pleasing
  • to compile to TEX use C-c C-e d
  • for HTML use C-c C-e h (FOR CODE HIGHLIGHTING INSTALL htmlize.el)
  • these commands will also evaluate all the chunks where “:eval” = yes to load the data and calculate the results fresh.
  • AWESOME!

3 Store your work simply and document it properly

  • I use the ProjectTemplate R package to organise my code sections into modules
  • These modules are organised into the Reichian LCFD paradigm described first on StackOverflow here, and encoded into the makeProject R package
  • documentation is within the main orgmode script
  • data documentation is a whole other universe that I will deal with in a separate post

4 use a version control system using git and github

# once you have the project via R
R
require(ProjectTemplate)
create.project("AwesomeProject", minimal = T)
q()
# use the shell to start a git repo
cd AwesomeProject
git init
# and commit the TODO
git add TODO
git commit -m "first commit"
# tada!
  • Emacs can now be used to manage the git repo using the C-x g command
  • Rstudio has a really nice GUI for doing this inside it;s Project management interface too.

Using Github or another Git Server

Git Code:

cd /path/to/local/git/repo
git remote add origin git@github-or-other-server:myname/myproject.git
git push origin master

5 Automate back ups AND 6 Avoid distracting gadgets

  • OMG backups stress me out
  • ideally I would follow this advice because “when it comes to losing your data the universe tends toward maximum irony. Don’t push it.”
  • But I don;t fully comply
  • Instead I generally use Dropbox for basic project management admin stuff
  • I use github for code projects I am happy to share, I also pay for 10 private repos
  • I Set up a git server at my workplace for extra projects but this is on a test server that is not backed up, and I am not really happy about this
  • In terms of Distracting Gadgets, I think that with the current tempo of new innovations related to new software tools for this type of work I should keep trying new things but I have pretty much settled into a comfortable zone with the gadgets I described here.

Conclusions

  • This is how I’ve worked for a couple of years
  • I find it very enjoyable, mostly productive but prone to the distractions of “distractions by gadgets”
  • The main thing I want to point out is the usage of Code Chunks in RRR scripts.
  • These things are awesome.

Posted in  research methods


sync-endnote-and-mendeley-references-using-r-xml

Background

  • I use Mendeley (despite them being bought out by Elsevier, who used to sell guns)
  • My Colleagues use EndNote
  • Need to sync, they find Endnote better for their workflow
  • I tried to export my Mendeley as XML and import to Endnote, but found many duplicates that took time to rectify
  • (and the risk is there that the RefNo they used in the Doc will be the duplicate that I removed)

Aims

  • test if R and the XML package can help find refs in Endnote that aren’t in Mendeley
  • If so can I write those into an Mendeley import for seamless integrations
  • and what about going from Mendeley to Endnote?

Methods

  • The R XML package seems an obvious place to start
  • before writing a function, just step thru the process

Step 1: export XML from Mendeley and Endnote

  • In Mendeley Just select the refs in the list and then from the file menu
  • In Endnote it is also under File menu

Step 2: R Code:

# func
# might need sudo apt-get install r-cran-xml?
require(XML)

# load
dir()
[1] "EndnoteCollection.xml"
    "MendeleyCollection.Data"
[3] "MendeleyCollection.xml" 

d1 <- xmlTreeParse("EndnoteCollection.xml", useInternal=T)

# clean
str(d1)
# ooooh xml schmexemhel voodoo?

# do
top <- xmlRoot(d1)
str(top)
names(top)
# top [[1]] # prints the whole thing
top [[1]][[1]]
top [[1]][[2]]
# prints a record (1 or 2)

# just messing around
length(top[[1]])
top [[1]][[120]]
names(top [[1]][[120]])
names(top [[1]][[120]][["contributors"]])
names(top [[1]][[120]][["contributors"]][["authors"]])
top [[1]][[120]][["contributors"]][["authors"]][[2]]

i <- 110
top [[1]][[i]]
as.matrix(names(top [[1]][[i]]))

OK so XML as a list.

  • I think if I do a merge of two author-date-title dataframes I can easily find the diffs

TRY a square wheel

R Code:

endnote_mendeley_df <- function(input_xml,
                                nrow_to_try = 1000){
  d1 <- xmlTreeParse(input_xml, useInternal=T)
  top <- xmlRoot(d1)
 
  output <- matrix(ncol = 4, nrow = 0)
  for(i in 1:nrow_to_try)
  {
  # i = 1000
    if(is.na(xmlValue(top [[1]][[i]]))) break
    if(
      !is.na(xmlValue(top [[1]][[i]][["contributors"]][["authors"]][[2]]))
      )
    {
      author <- paste(xmlValue(top [[1]][[i]][["contributors"]][["authors"]][[1]]), "et al", " ")
    } else {
      author <- xmlValue(top [[1]][[i]][["contributors"]][["authors"]][[1]])
    }
    year <- xmlValue(top [[1]][[i]][["dates"]][["year"]])
    title <- xmlValue(top [[1]][[i]][["titles"]][[1]])
    endnoteref <- xmlValue(top [[1]][[i]][["rec-number"]])
    output <- rbind(output, c(author, year, title, endnoteref))
 
  }
  output <- as.data.frame(output)
  return(output)
}

R Test:

output <- endnote_mendeley_df(
  input_xml = "EndnoteCollection.xml"
  ,
  nrow_to_try = 10
  )

nrow(output)
write.csv(output, "EndnoteCollection.csv", row.names = F)
output  <- read.csv("EndnoteCollection.csv", stringsAsFactors = F)
str(output)
output[,1:2]

R Do-read:

endnote <- endnote_mendeley_df(
  input_xml = "EndnoteCollection.xml"
  )
nrow(endnote)
mendeley <- endnote_mendeley_df(
  input_xml = "MendeleyCollection.xml"
  )
nrow(mendeley)

R Do-concatenate and lowercase

# TODO this is a really terrible way to do this.
# FIXME find out how to compare the two better
require(stringr)
mendeley2 <- str_c(mendeley$V1, mendeley$V2, mendeley$V3)
mendeley2 <- gsub(" ", "", mendeley2)
mendeley2 <- gsub(",", "", mendeley2)
mendeley2 <- tolower(mendeley2)
mendeley2[1:5]
mendeley$mendeley2 <- mendeley2

# now do this again from endnote
endnote2 <- str_c(endnote$V1, endnote$V2, endnote$V3)
endnote2 <- gsub(" ", "", endnote2)
endnote2 <- gsub(",", "", endnote2)
endnote2 <- tolower(endnote2)
endnote2[1:5]
endnote$endnote2 <- endnote2

R Do-merge:

endnote_not_in_mendeley <- merge(endnote,
                                 mendeley,
                                 by.x = "endnote2",
                                 by.y = "mendeley2",
                                 all.x = T)
str(endnote_not_in_mendeley)
nrow(endnote_not_in_mendeley)
head(endnote_not_in_mendeley)
endnote_not_in_mendeley <- endnote_not_in_mendeley[
                                                   is.na(endnote_not_in_mendeley$V1.y),
                                                   ]
nrow(endnote_not_in_mendeley)
# 66 refs in endnote are not in mendeley
write.csv(endnote_not_in_mendeley,
      "endnote_not_in_mendeley.csv", row.names = F) #### Open this as spreadsheet and cross check - make a new column for comments - check off which ones were in AllDocuments and not in the Mendeley group - this diff was because of when I imported the Endnote XML but had not assigned these to the mendeley group - once have cleaned up the mendeley group export again and then check which are in mendeley but not in endnote

First here is a note

  • about a way to speed up the checks, excluding false positives using fuzzy matching
  • my method relies on the author, date and title to be written the same in both ie initials then surname or visa verca
  • But this is not always true
  • I previously used levenshtein string matching to identify strings that are close but not identical
  • Try this link
  • OR this link
  • TODO I will share this code as a GitHub Gist later!

R Code: possibility to speed up Checks

tmp1 <- mendeley[grep("Walker", mendeley$V1),"mendeley2"]
tmp2 <- endnote[grep("Walker", endnote$V1),"endnote2"]

# these differ slightly
# B. Walker et al vs Walker, Brian et al
source("~/Dropbox/tools/levenshtein.r")
levenshtein(
    tmp1
    ,
    tmp2
    )
# gives 92percent match

R Code: Find mendeley refs without endnote

  endnote <- endnote_mendeley_df(
    input_xml = "EndnoteCollection.xml"
    )
  nrow(endnote)
  mendeley <- endnote_mendeley_df(
    input_xml = "MendeleyCollection2.xml"
    )
  nrow(mendeley) #### R Code: Do-concatenate and lowercase
require(stringr)
mendeley2 <- str_c(mendeley$V1, mendeley$V2, mendeley$V3)
mendeley2 <- gsub(" ", "", mendeley2)
mendeley2 <- gsub(",", "", mendeley2)
mendeley2 <- tolower(mendeley2)
mendeley2[1:5]
mendeley$mendeley2 <- mendeley2

# now do this again from endnote
endnote2 <- str_c(endnote$V1, endnote$V2, endnote$V3)
endnote2 <- gsub(" ", "", endnote2)
endnote2 <- gsub(",", "", endnote2)
endnote2 <- tolower(endnote2)
endnote2[1:5]
endnote$endnote2 <- endnote2

R Do-merge:

mendeley_not_in_endnote <- merge(mendeley,
                                 endnote,
                                 by.y = "endnote2",
                                 by.x = "mendeley2",
                                 all.x = T)
str(mendeley_not_in_endnote)
nrow(mendeley_not_in_endnote)
head(mendeley_not_in_endnote)
mendeley_not_in_endnote <- mendeley_not_in_endnote[
                                                      is.na(mendeley_not_in_endnote$V1.y),
                                                      ]
nrow(mendeley_not_in_endnote)
# 92 refs in endnote are not in mendeley
write.csv(mendeley_not_in_endnote,
      "mendeley_not_in_endnote.csv", row.names = F) #### Not all these 92 will be true - so let;s try the string matching

R Code:

source("~/Dropbox/tools/levenshtein.r")
pcnt_threshold <- 0.6
out_list <- matrix(ncol = 3, nrow = 0)
#out_list <- read.csv("mendeley_not_in_endnote_fz_match.csv", stringsAsFactors = F)
for(i in 36:nrow(mendeley_not_in_endnote))
    {
        print(i)
#        i = 2
tmp1 <- mendeley_not_in_endnote[i,1]
    for(j in 1:nrow(endnote))
        {
    #        j = 2
    if(exists("out")) rm(out)
    tmp2 <- endnote$endnote2[j]
    pcnt <- levenshtein(
            tmp1
            ,
            tmp2
            )
    #pcnt
    if(pcnt >= pcnt_threshold) out <- tmp2
    if(exists("out"))
        out_list <- rbind(out_list, c(tmp1, tmp2, pcnt))
    if(exists("out")) break
        }
        
    }
out_list
write.csv(out_list, "mendeley_not_in_endnote_fz_match.csv", row.names = F) #### R Code: Do-concatenate and lowercase
require(stringr)
out_list <- read.csv("mendeley_not_in_endnote_fz_match.csv", stringsAsFactors = F)
mendeley2 <- read.csv("mendeley_not_in_endnote.csv", stringsAsFactors=F)
mendeley2[1,]
out_list[1,]
mendeley2 <- merge(mendeley_not_in_endnote, out_list,
                   by.x = "mendeley2",
                   by.y = "V1", all.x = T)
mendeley2[2,]
mendeley2 <- mendeley2[is.na(mendeley2$V3),]
nrow(mendeley2)
# 48 records
write.csv(mendeley2, "mendeley_not_in_endnote_best_estimate.csv", row.names=F) #### Results - I found that XML package in R can work with the Endnote and Mendeley export Files - I think I made a lot of bad decisions about the way I went about  doing this! - It seemed quite difficult to get the XML stuff to make sense to me - I;ve heard that python has better libraries for working with XML - the levenshtein string matching code proved useful again.  I should get out of the habit of looping and start using lapply etc to speed this up.

Conclusions

  • This was an interesting if frustrating experiment
  • The minor issues with importing from mendeley/endnote and deduplicating using their own tools was probably not worth writing all this half-baked R code.
  • But I did learn more about working with XML in R (and realised this is probably not one of R’s Strengths – or mine for that matter!)

Posted in  research methods


git-can-be-simple-or-very-complicated

  • Git is a Distributed Version Control System.
  • The centerforopenscience.org has developed the Open Science Framework which is they say “a simplified front end to the powerful and popular version control system Git”.
  • I use Github a lot for extending the local features into an online space
  • So I finally got around to poking the open science framework with the Hutchinson Drought Index project
  • It turns out to be too simplified, and not have very many of the feature I love about Git and GitHub :-(
  • For eg it is not really distributed in that you don’t get to sync your local repo with the onlin version
  • you upload a script or dataset, then you continue editing locally until you want to commit and then you have to upload again, one file at a time with the GUI rather than “git add .” and “git push”.
  • I recommend having a look, it might work for you, but if you want more power checkout Yihui’s suggestions for using github http://yihui.name/en/2011/12/how-to-become-an-efficient-and-collaborative-r-programmer/
  • and http://yihui.name/en/2013/06/fix-typo-in-documentation/

In general I don’t think simple front-end’s should be a barrier to accessing a sophisticated back-end!

Posted in  research methods