Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Template Reproducible Research with R TEX and Sweave

  • Last year I wrote to a Professor to let them know some numbers that quantified their effect estimate (computed be exponentiating their beta coefficient) described in their results section was inconsistent with the numbers in their table (the raw coefficients).
  • Happily the recalculation showed that they had underestimated the effect size and their conclusions were not wrong, but erroneously conservative.
  • Previously I would use Sweave documents to keep track of all my calculations, but I was still copy-and-pasting the key numbers into the text.
  • I’ve become more interested now in using Inline R Outputs (called Sexpr - S expressions)
  • This is also the first time I’ve created a report using the Palatino font, on the advice of a new colleague of mine.
  • I popped up a Github repo with a Sweave template for Reproducible Reports with a few notes I made. Which looks like

/images/sexpr.png

Posted in  research methods


studygroup-review-of-tree-based-models-for-testing-multiple-working-hypotheses

Yesterday I had the opportunity to review the Tree based modelling methods we are implementing in a study I’m working on at the moment. I based the discussion on a paper from 2012, along with some notes I have made related to the use of these methods in our context. I had an hour and a half with a couple of senior statisticians and a bunch of sociologists at the study group yesterday.

My main question for the group was:

  • What do you think of the proposition that these models are suitable for “studies that test hypotheses generated from multiple theories”?

The statisticians where underwhelmed by the Cutts study, but thought the stats method was “neat” and one had not heard of tree models before (!).

Key outcomes of that session were:

  • First stage PCA
    • Q: did they need to do the PCA first stage to either a) reduce the the large number of potentially collinear variables or b) to control for measurement error?
    • A: No they didn’t need the PCA. The large number, collinearity and measurement efficacy are dealt with by the tree based methods (ie cross-validation on steroids, grouping primary and surrogate splits, etc). Also the trees have an advantage with the large number of predictors in that they are non-parametric and able to automatically detect interactions whereas PCA is parametric and typically assumes linearity. (Note that I saw Steve afterward and he is still not convinced. Maybe he will respond to this with some description of the reason why not?)
  • Rescaling the variables
    • Q: Did they need to centre each variable on the grand sample mean so that the relative weight of each variable was even?
    • A: No the tree models are not effected by this, and it reduced the interpret-ability of the decision tree graphic. The PCA is effected, however, and maybe that is why they did it.
  • Multiple working hypotheses
    • Q: What do you think of the Cutts et al proposition that these models are suitable for “studies that test hypotheses generated from multiple theories”?
    • A: It is a promising approach. Not clear from this study if it is successful but the proposition is plausible. It has more of a chance than the General Linear Model approach which cannot (this point was made categorically by one of the statisticians). It will be important to be very clear about what exactly each of the “Theories” are predicting, so that the explanatory power attributable to those variables from one “theory” can be compared with that of the other theories. The ability to uncover complex interactions is very attractive (extrovert/introvert personality type interacting with group demographics etc) but it also increases the potential for spurious results (and ‘chasing the noise’) ergo the need to base inferences on what the theories predict, not just on what the data reveal.
  • In general we thought that authors had overdone the stats. It showed what ‘could’ be done with trees and forests, but possibly not what ‘should’ be done.

  • Is the A3 variable really missing from ctree output but dominant in the randomForest in Figure3?
    • when another group I’m in reviewed it last year it was noted as odd that fig3 has A3 most important in randomforest but does not appear in ctree. At the time I thought this might be due to the cross validation like in the rpart or tree package, but it turns out that ctree does not do that….
    • Mislabelling is a possibility as these graphics appear to be edited from the default produced by the ‘party’ package.
    • However, it also reasonable that these results could be correct. ctree (and rpart, tree, etc.) use greedy algorithms, meaning that the best local split is used even if this is suboptimal globally.
    • Basically there are algorithmic reasons that this might be a true difference between ctree and randomforest. It might just have been the way the cookie crumbled for the ctree’s best local split which may not have survived through the randomForest’s thousands of iterations.
    • But on page 569 they do say “total knowledge is lowest among those whod do not think that they are empowered and are unaware of information sources” which implies to me that the top or left-most node might be mis-labelled I3 when it is actually A3.
    • this would give the appropriate left most box plot (ie I3 <= -0.22 & A3 <= -0.64 gives lowest total knowledge??)
    • at the group yesterday people also generally suspected mislabelling, given the dominance of A3 in randomforest.

Posted in  research methods overview


Morpho And R-EML Use Case Marsupial mulgara Dasycercus cristicauda

Aim

  • The EML R package uses the Ecological Metadata Language (EML) approach that allows archiving of very heterogeneous data without having to standardize everything into a narrow and pre-defined syntax.
  • XML files in specificed schemas involve strict criteria and are thus best generated by software.
  • Morpho is an application that provides another GUI based method of generating EML but is a rather tedious tool for generating EML files. Unfortunately, without the ability to script inputs or automatically detect existing data structures, we are forced through the rather arduous process of adding all metadata annotation each time.
  • The aim of this experiment is to use the EML package to create some advanced metadata quickly and then finish this off with Morpho, using “boilerplate code” wherever possible.
  • this builds on my previous post http://ivanhanigan.github.io/2013/10/morpho-and-reml-streamline-the-process-of-metadata-entry

Background

  • I am basically very lazy when it comes to entering metadata and when I use the Morpho package for metadata data entry I get frustrated with having to step through ever SINGLE variable and use the drop down menus etc to describe them as essentially “number” or “text”
  • A big reason I like Morpho is because Metacat is a great data portal and Kepler is a promising scientific workflow tool, and all three are produced by the same group so it would be great to get them working together…
  • Morpho and Metacat are open source software designed to host all kinds of ecological data. More information about it can be found here
  • More info about the Metacat Data Portal System is here.
  • For technical reasons, I’m running an older version of the Morpho software because I’m working with an older version of the Metacat portal software and so are also constrained to running the older Morpho version too (but will be upgrading soon).
  • You might want to look at the background of the Ecological Metadata Language (EML) standard. I like this page http://carlboettiger.info/2013/06/23/notes-on-leveraging-the-ecological-markup-language.html along with the references he cites at the bottom.

Material

  • To tie this experiment back to something that is actually useful for a scientist, I will use the field-based example data on the effects on mulgara of removing spinifex, from:

    McCarthy, M. a., & Masters, P. (2005). Profiting from prior information in Bayesian analyses of ecological data. Journal of Applied Ecology, 42(6), 1012–1019. doi:10.1111/j.1365-2664.2005.01101.x

  • Brief description is:

    an experimental manipulation of habitat was conducted by Masters, Dickman & Crowther (2003) in which vegetation cover of a site in arid inland Australia was reduced and the response of the mammal fauna monitored.

  • you can find the data in the download file from the “Code for analysing the mulgara experiment” from here
  • I first checked that these data aren;t already on the KNB repository
  • I searched for Marsupial, Australia, Mulgara and etc, finding no hits.
  • We will assume that these data are not already published there.

Methods

R Code:

# func
library("devtools")
install_github("EML", "ropensci")       
library("EML")
install_github("disentangle", "ivanhanigan")       
library("disentangle")

# load                                                                    
datatext <- 'Treat, Before, After1, After2
0,  2.833213344,    1.609437912,    2.48490665
0,  1.791759469,    2.197224577,    2.079441542
0,  3.044522438,    2.708050201,    3.135494216
0,  2.772588722,    1.791759469,    2.197224577
0,  1.098612289,    1.609437912,    2.63905733
1,  2.944438979,    0.693147181,    1.791759469
1,  2.564949357,    0.693147181,    1.791759469
1,  2.564949357,    1.609437912,    1.609437912
1,  0.693147181,    1.098612289,    1.098612289
1,  1.609437912,    0,      1.098612289'
analyte <- read.csv(textConnection(datatext))

# check
analyte

# do
## from a work dir with a subdir for data
write.csv(analyte, "data/mulgara.csv", row.names = F)
reml_boilerplate(
  data_set = analyte
  ,
  created_by = "Ivan Hanigan <ivanhanigan@gmail.com>"
  ,
  data_dir = "data"
  ,
  titl = "mulgara"
  ,
  desc = "Experimental data: effect of cover reduction on mulgara Dasycercus cristicauda"
  )

  • Now open morpho and under file > import browse to this data directory and import.
  • Got several warnings, about unable to display data, and an older version that could be updated

Results

  • Result does not display in morpho

mulgara-morpho-import.png

Discussion

  • There is something different about the way the EML R package writes the data and what Morpho 1.8 is expecting.

Conclusions

  • Further research is required.

Posted in  Data Documentation


Github, gh-pages and disqus comments

A while ago I posted about sharing-and-extending-research-protocols.

I’ve started a new experiment for hosting a discussion around issues, suggesting new issues, agreeing on solutions, toward an agreement on methods that could become a protocol: http://ivanhanigan.github.com/datasharing.

I forked the material from the original author Jeff Leek https://github.com/jtleek/datasharing/network.

The goal of my experiment is something along the lines of the Prometheus Wiki http://prometheuswiki.publish.csiro.au​ which is a site for sharing research protocols. That idea is to give people a place to post research protocols since everyone develops them and then mentions them in papers but they rarely make it online in a usable format.

But I was talking with an user of that and he complained it lacked a kind of “dynamic collaboration with a front-end markup system in place that was integrated with a good website-type backend”. This is what the github site might be able to do.

I discussed with a colleague and he seemed to be receptive to experimenting with this, so long as it was not more cumbersome than:

  • shooting off an email with a list of points or
  • catching me in the tea room and saying “by the way - missing values should never be -9999”
  • and then these being copied into a master document we all share.

The system I’m using in the proposed experiment uses the hi-tech tools gh-pages with disqus comments. This let’s:

  • casual users chip in their two cents worth quickly via the comments,
  • users can vote up or vote down other peoples comments,
  • track the discussion via their emails (if they choose that option),
  • but those wanting deeper involvement can fork and edit the pages and then submit pull requests to the lead author.
  • Github’s wiki and issues tracking functionality also could be used for serious development.

Posted in  research methods


cwt-lter-data-submission-template-critique

  • A colleague sent me the cwt_data_subm_template_2013.xls today
  • You can download a copy here coweeta.uga.edu/resources/forms/cwt_data_subm_template_2013.xls
  • LTER is The U.S. Long-Term Ecological Research (LTER) network
  • I made the following notes, this is not intended to be a nasty critique
  • The following is a few Frank and Fearless comments I’ll be using to compare the pros and cons of a variety of data documentation approaches

Critique

  • opened first on windows, saw comments on cells with instructions
  • opened next on linux with libreOffice and comments are gone
  • opened at the last tab (split in two for no reason?)
  • noticed recommended name “GCE site” = Site, otherwise “permanent plot” = Plot?
  • GCE = Georgia Coastal Ecosystems LTER program
  • flip to first tab, point 4 suggests there is some export functionality I cannot see (a VBA script?)
  • cell 11 a NOTE: When submitting updated metadata or re-using templates please highlight fields with modified contents in yellow
  • and use glitter pen???
  • personnell tab OK
  • instrumentation, variable measured is free text. ok but for eg “max temp”, “temperature maxima”, “maximum temperature (c)” “maximum temperature in 24 hours after 9am local time in degrees” etc
  • too wide, last column was off my wide screen! noticed wasted real estate in column A
  • tabular data “– Paste or enter your data values into the ‘Values’ section (white cells), starting with the indicated cell”
  • this is an invitation for clerical error! Too many “copy-and-paste” actions will inevtably introduce errors
  • I do like the extra metadata Column Name: – Description: – Units: – Data type: – Variable type: – Number type: – Precision: – Code values: – Calculations: – QC: Minimum Valid: – QC: Minimum Expected: – QC: Maximum Expected: – QC: Maximum Valid: – QC: Custom: – Fill in missing values in the table with NaN (not a number), including text fields, and do not skip columns
  • but what about missing values imbued with other meanings (NA = not observed, censored etc)?
  • ask users to format digit rounding in Excel?? oh no
  • old excel users may still be restricted to 65,536 rows by 256 columns.
  • non tabular sheet is ok

Posted in  Data Documentation