Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

Complex Model Selection (or Data Dredging)

Toward Automated Model Selection

The Bluetongue paper we've been discussing at the ANU GIS forum correctly points out that with the "the large number of candidate variables … a huge number of models could be considered."

They go on to say that:

"Thus, for practical reasons, we … isolate independently for each of three thematic sets of variables (host-, meteorological- and landscape-related covariates) a combination of variables best fitting the data."

I don't really get this. Why not fit the huge number of models (RAM and disk speed permitting) and let AIC or BIC sift out any combinations that perform well? For example in a simple instance with four explanatory variables and no interactions the rich model would be:

\(Y_{i} = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} X_{4}\)

Lu, Sonya and I are working on a function to do all the possible combos (interaction terms are possible to include too). So far we have this code and a link to an old Hadley Wickham quote. That's from 2007, and still haven't herd about any tool developed that really does this.

So far we have this code:

combos  <- function(yvar, xvars, compulsory = NA)
  {
    formlas <- NULL
    for(j in length(xvars):1)
      {
        combns <- combn(xvars, j)
        for(i in 1:ncol(combns))
          {
            terms2include <- combns[,i]
            if(!is.na(compulsory[1]))
              {
                terms2include  <- c(terms2include, compulsory)
              }
            formla <- reformulate(terms2include,                                  
                                  response = yvar
                                  )
            formlas <- c(formlas,formla)     
          }
      }
    return(formlas)
  }

The resulting list of candidate models are:

formlas <- combos(yvar = "deaths",
                  xvars = c("x1", "x2", "x3", "x4")
                  )
paste(formlas)

deaths ~ x1 + x2 + x3 + x4
deaths ~ x1 + x2 + x3
deaths ~ x1 + x2 + x4
deaths ~ x1 + x3 + x4
deaths ~ x2 + x3 + x4
deaths ~ x1 + x2
deaths ~ x1 + x3
deaths ~ x1 + x4
deaths ~ x2 + x3
deaths ~ x2 + x4
deaths ~ x3 + x4
deaths ~ x1
deaths ~ x2
deaths ~ x3
deaths ~ x4

2 Compulsory inclusions

In some instances you may want to include a variable in all models so the compulsory option can be used. For example in spatio-temporal models we could include a term for zone and time while assessing the mix of explanatory variables:

formlas <- combos(yvar = "deaths",
                  xvars = c("x1", "x2", "x3", "x4"),
                  compulsory = c("zone", "ns(time, df = 3)")
                  )
paste(formlas)

deaths ~ x1 + x2 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + x3 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x2 + x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + x2 + zone + ns(time, df = 3)
deaths ~ x1 + x3 + zone + ns(time, df = 3)
deaths ~ x1 + x4 + zone + ns(time, df = 3)
deaths ~ x2 + x3 + zone + ns(time, df = 3)
deaths ~ x2 + x4 + zone + ns(time, df = 3)
deaths ~ x3 + x4 + zone + ns(time, df = 3)
deaths ~ x1 + zone + ns(time, df = 3)
deaths ~ x2 + zone + ns(time, df = 3)
deaths ~ x3 + zone + ns(time, df = 3)
deaths ~ x4 + zone + ns(time, df = 3)

3 AIC vs BIC vs LRTests

Well now we get into a sort of philosophical debate on how to rank all these models. That'll have to wait for another day.

Posted in  spatial dependence


blog comments powered by Disqus