Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Starting my Open Notebook Science Blog

Many examples are emerging of scientists who are transitioning to a much more open model of research. This is in part externally driven by funding bodies (such as the Aussie Research Council asking for deposit of funded data and papers) and journals (ie. Nature journals removing length restrictions on Methods sections.). Also the increased value being placed on transparency of reproducible analysis to safeguard against error and fraud is becoming an internal driver within science communities.

Open Notebook Science (ONS) style is an extreme of transparent approaches to research. According to the wikipedia page it is the “practice of making the entire primary record of a research project publicly available online as it is recorded”.

That’s pretty extreme! In my view a lot of stuff in the research project should probably be archived quickly and left to rot.

I like the range of options available. I think I’ll go for SCD or “Seclected Content / Delayed” and show their image below. In this model a portion of the open notebook and associated supporting raw data are available after some delay. I’ll try to use this blog for weekly updates on progress for each project, and provide links off my ‘Open Notebook’ and ‘Software’ Tabs.

ONS-SCD.png

Posted in  research methods


animated-maps

Animated maps to allow exploration of alternate levels of ‘jitter’

In a previous project we published a map of point locations that had been ‘jittered’, ie adding random noise to the latitude and longitude. We did this by testing out a few maps and deciding on one that we thought protected privacy adequately whilst not destroying the spatial pattern we wished to display (evocatively).

Figure 2_FINAL.jpg

I always wondered about a way to interactively do this and I think the animation package might do the trick, with the ability to step thru levels of jittering with the pause, fwd and back buttons.

Clink here for the same data shown in a new animation.

Reference

Vally, H., Peel, M., Dowse, G. K., Cameron, S., Codde, J. P., Hanigan, I., & Lindsay, M. D. a. (2012). Geographic Information Systems used to describe the link between the risk of Ross River virus infection and proximity to the Leschenault estuary, WA. Australian and New Zealand Journal of Public Health, 36(3), 229–235. doi:10.1111/j.1753-6405.2012.00869.x

Posted in  spatial


spatio-temporal-animations

Space time animations are cool

This is a graphic I made a few years ago of temperature in Sydney. It is a GAM with smoothing splines on longitude, latitude and time (t):

require(mgcv)
fit <- gam(temp ~ te(lon,lat,t,d=c(2,1),bs=c("tp","cr")), data=jan06)  

spacetimegamSydney3hrTemp.gif

Clink here for an example of where I think this kind of animation should go now.

The addition of stop/play/fwd/reverse buttons adds potential for exploratory insights.

The secret to getting the play/pause/next buttons is to insert graphing code between:

saveHTML(..., outdir = getwd())

Unfortunately the rest of the code that accompanies the graphic above is too specific to my workflow that it is not reproducible. That was back before (during the time that) I became obsessed with the prospect of creating reproducible data analysis workflows.

Posted in  spatial


Complex Model Selection (or Data Dredging)

Toward Automated Model Selection

The Bluetongue paper we've been discussing at the ANU GIS forum correctly points out that with the "the large number of candidate variables … a huge number of models could be considered."

They go on to say that:

"Thus, for practical reasons, we … isolate independently for each of three thematic sets of variables (host-, meteorological- and landscape-related covariates) a combination of variables best fitting the data."

I don't really get this. Why not fit the huge number of models (RAM and disk speed permitting) and let AIC or BIC sift out any combinations that perform well? For example in a simple instance with four explanatory variables and no interactions the rich model would be:

\(Y_{i} = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} X_{4}\)

Lu, Sonya and I are working on a function to do all the possible combos (interaction terms are possible to include too). So far we have this code and a link to an old Hadley Wickham quote. That's from 2007, and still haven't herd about any tool developed that really does this.

So far we have this code:

combos  <- function(yvar, xvars, compulsory = NA)
  {
    formlas <- NULL
    for(j in length(xvars):1)
      {
        combns <- combn(xvars, j)
        for(i in 1:ncol(combns))
          {
            terms2include <- combns[,i]
            if(!is.na(compulsory[1]))
              {
                terms2include  <- c(terms2include, compulsory)
              }
            formla <- reformulate(terms2include,                                  
                                  response = yvar
                                  )
            formlas <- c(formlas,formla)     
          }
      }
    return(formlas)
  }

The resulting list of candidate models are:

formlas <- combos(yvar = "deaths",
                  xvars = c("x1", "x2", "x3", "x4")
                  )
paste(formlas)

deaths ~ x1 + x2 + x3 + x4
deaths ~ x1 + x2 + x3
deaths ~ x1 + x2 + x4
deaths ~ x1 + x3 + x4
deaths ~ x2 + x3 + x4
deaths ~ x1 + x2
deaths ~ x1 + x3
deaths ~ x1 + x4
deaths ~ x2 + x3
deaths ~ x2 + x4
deaths ~ x3 + x4
deaths ~ x1
deaths ~ x2
deaths ~ x3
deaths ~ x4

2 Compulsory inclusions

In some instances you may want to include a variable in all models so the compulsory option can be used. For example in spatio-temporal models we could include a term for zone and time while assessing the mix of explanatory variables:

formlas <- combos(yvar = "deaths",
                  xvars = c("x1", "x2", "x3", "x4"),
                  compulsory = c("zone", "ns(time, df = 3)")
                  )
paste(formlas)

deaths ~ x1 + x2 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + x3 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x2 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + zone + ns(time, df = 3)
deaths ~ x1 + x3 + zone + ns(time, df = 3)
deaths ~ x1 + x4 + zone + ns(time, df = 3)
deaths ~ x2 + x3 + zone + ns(time, df = 3)
deaths ~ x2 + x4 + zone + ns(time, df = 3)
deaths ~ x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + zone + ns(time, df = 3)
deaths ~ x2 + zone + ns(time, df = 3)
deaths ~ x3 + zone + ns(time, df = 3)
deaths ~ x4 + zone + ns(time, df = 3)

3 AIC vs BIC vs LRTests

Well now we get into a sort of philosophical debate on how to rank all these models. That'll have to wait for another day.

Posted in  spatial dependence


Pioz et al 2012 model selection

In the GIS forum SPDEP study group we’ve been discussing the Bluetongue paper http://www.mendeley.com/research/why-did-bluetongue-spread-the-way-it-did

I’d like to know more about the the Lagrange Multiplier tests and Francis raised the seminal Anselin 1988 paper for that

But in this post I just wanted to summarise their model selection procedure in a flow diagram

pioz_modelling.png

Posted in  spatial