Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

ONS-SCD.png

reflections-bob-haining-update

1 Update on reflections from Bob Haining's Lecture

Earlier this year Prof Bob Haining from the Geography Department Cambridge visited and gave us a great lecture on spatial regression.

This Tuesday at the GIS Forum we were lucky to be joined by statistician Phil Kokic from CSIRO who had heard we'd be discussing spatial autocorrelation (Phil is my PhD supervisor). Here are some quick notes I made:

1.1 CART Tree analysis that addresses the (potential)spatial autocorrelation problem

We started off the discussion with an assessment of the approach described in this post Classification Trees and Spatial Autocorrelation.

I've been thinking more and more about decision trees/CART/random forest methods for selecting a subset of relevant variables (and interations) for use in GLM or GAM model construction. In a perfect world I'd have data on the main predictor I wanted to model and enough data about all the relevant other predictors (especially confounding or modifying variables) to ensure I get a 'well behaved model'. But with all the data around and so many potentially plausible relationships one might choose to include we need a way to narrow down these to just include the most important covariates, confounders and interactions. CART or some variation on it seems a good way to do this, but is prone to the potential problem of spatially correlated errors too.

The idea from that blog post is:

"compute the classification tree, calculate residuals and use it for a Mantel-test and Mantel correlograms. The Mantel correlograms test differrences in dissimilarities of the residuals across several spatial distances and thus enable you to detect lag-distances where possible spatial autocorrelation vanishes. …If encounter autocorrelation… try to use subsamples of the data avoiding resampling within the lag-distance.."

I think the workflow would be to

  • fit the classification tree (Question: best to use all the data or with a sample like using cross-validation)
  • get the residuals and visually assess the lagged distances plot provided by the Mantel correlogram. Decide on a threshold (Question: is there an objective way to do this?).
  • Sample from the data and select out from this sample only data from pairs with distances greater than the threshold (have to keep one out of each close pair or else we'd only be getting data from the sparsely sampled parts of our study region).

We all agreed this sounded OK, but only avoids the problem of spatial autocorrelation (and loses data).

1.2 Modeling with control for spatial autocorrelation

So we all agreed we'd prefer if our model can control for spatial autocorrelation. I confessed that I'd always found the GeoBUGS tutorial and other tutorials about Bayesian methods for this very difficult and would really like a "Simple" way to make the problem go away. So first we briefly reviewed Prof Hainings 3 equations again:

NOTE: THE FOLLOWING IDEAS WORK BEST FOR AREAL DATA.

1.3 The Spatial Error Model

\(Y_{i} = \beta_{0} + \beta_{1} X_{1i} + \eta_{i}\)

Where:

\(\eta_{i}\) = Spatially autocorrelated errors.

1.4 The Spatial Lag Model

\(Y_{i} = \beta_{0} + \beta_{1} X_{1i} + \rho(Neighbours Y_{ij}) + e_{i}\)

Where:

\(\rho_(Neighbours Y_{ij})\) = is an additional explanatory variable which is the value of the dependent variable in neighbouring areas.

1.5 Spatially Lagged Independent Variable(s)

\(Y_{i} = \beta_{0} + \beta_{1} X_{1i} + \beta_{2L} X_{2ij} + e_{i}\)

Where:

\(\beta_{2L} X_{2ij}\) = is the independent variable X2 that is spatially lagged.

1.6 Discussion

  • Phil agreed with Bob that the spatial error model is the best, spatial lag model is OK and spatially lagged covariates not so great.
  • For spatial error model fitting Phil suggested looking at R packages spBayes and spTimer.
  • I pointed out that I am mostly interested in "spatially structured time-series models" rather than spatial models at a single point in time. By this I mean that we have several neighbouring areal units observed over a period of time. In this framework the general methods of time series modelling are used to control for temporal autocorrelation. However this makes the methods of spatial error and spatial lag models tricky because the spatial autocorrelation needs to be assessed at many points in time.
  • I asked that if spatial lag is OK (and it seems easier to fit into the time-series model framework) how can I check to know if it has done the trick? If this were purely a spatial model we could check for spatial autocorrelation in the residuals just as they described in the CART blog above, but here we have many maps we could make (one every time point), and our spatial autocorrelation measure would surely vary a lot over time. SO would a simple way just be to asses the effect on the Standard Error on beta1 (our primary interest) and if it is bigger but still significant we can be reassured that our result isn't affected? Or perhaps we should assess the beta on the lagged variable, for instance is a significant p-value on the lagged Beta an indication that it is capturing the unmeasured spatial associations represented by the neighbourhood variable?
  • If it hadn't done the trick Nerida pointed out this might be because the Neighbourhoods are actually not appropriately represented by the first order neighbours and therefore more neighbours could be included, like moving out several concentric circles to wider and wider neighbourhoods
  • Nasser and Phil pointed out that the lagged variable (the outcome in the neighbours) includes an element of the exposure variables, and said that it would be difficult to 'unpack' what that part of the model meant.
  • so it looks like there is no simple answer and spatial error model is still preferred.

</html>

Posted in  spatial dependence


workflow-flowcharts-update

Workflow flowcharts - Update

A while back I posted about my work with the Rgraphviz toolbox toward a wrapper function that will allow me to track the connections between chunks of my code as I write it. This update includes notes from a discussion I had with Keith about this.

The basic use case is described by this code:

require(disentangle)
# Decide on the first step and create the starting node of the flowchart 
nodes <- newnode(name="NAME", inputs="INPUT", newgraph=T)
# FirstStep.
# Comment: do a bit of work
NAME <- myWorkFunction(INPUT)
# Decide on next step
nodes <- newnode(name="OUTPUT", inputs=c("NAME","ANOTHER THING"))
# Do the second step
OUTPUT <- mySecondFunction(NAME, ANOTHER_THING)

workflow-flowchart1.png

So as the scripted workflow develops, so does the flow chart. The alternative I described previously is to make a list of all the steps in a Balzerian Filelist Table.

Keith’s comments:

Keith gave me some great comments on that post. I’ll just jot them down here for now but will have to come back and re-write parts of my function and the descriptive blog post to address these at another time.

First section:

  • I said that multiple datasets used in an analysis could be broken down into discrete data packages. Keith asked if I thought they “should be”.
  • questioned whether the data warehouse at my work really is “Big Data”
  • I need to cite the graphviz and Rgraphviz software better
  • the function currently requires users to speciify newgraph = T to start, and fails to do anything if “nodes” object doesn’t exist. This is uneccessary. should make it create “nodes” if doesn’t exist. Then the newgraph argument is only needed to delete an old graph and start again.
  • is it required that any of these actually be new? if not then “newnode” seems a poor name for the function

From section “Code:adding nodes”:

  • the line where output is the name is confusing. Keith felt that output was not the name of the newnode (but it really is).
  • the name of the nodes object is hardcoded to be “nodes”. therefore don’t need to give it as an argument
  • proposed better syntac: newnode(name= “ANOTHER THING”, output = “OUTPUT”) which I think will also work?
  • wouldn’t it be great if links could be labelled? maybe better to focus on links rather than nodes? this seems more logical because links are transitions and where the action is, not at the nodes.
  • it might be helpful to review common actions to see what graph structures each implies.
  • RE lab-book from Chemistry: most people would not be familiar with how lab-books are written.
  • can newnode build circuits/cycles? can it add links between existing nodes?

Section after “shows this result”

  • tablexyz -> FileA -> Input1 should be one step because fileA is not used again. ditto fileB.
  • analysisResults is a different kind of thing. this is confusing.
  • why do I have to type “nodes <-“ every time? if it has to have this name It could be hidden

Under example two, “Step three”

  • huh? does input1 have “id”?

  • “the result” graph is a bit dense and hard to follow. Perhaps simpler labels? ie instead of FileA just “A”?

Posted in  research methods


introduction-to-ons-theory-and-practice

The reasons why ONS is so appealing to me can be broken into two parts. The first is about the problems which it solves, the second is about the benefits it might bring.

The Problems

  • Identifying errors (either from miscalculations or from methodological mistakes)
  • Uncovering fraud

The Benefits

  • Sharing interests and skills
  • Quickly finding out about new discoveries and failures

These benefits come from the enhanced potential for theoretical discussions and sharing ideas. This is especially valuable around difficult theories, unknown issues or esoteric theories that are known only to a specialist in a field.

Posted in  research methods


using-orgmode-and-jekyll-for-open-notebook

Using Orgmode and Jekyll for Open Notebook

Orgmode is a great notebook tool because it allows the coding, evaluation and documentation all in one. I also want to use it to send the documentation to my blog as an Open Notebook.

If starting again I’d look into this:

But as it is I already put a lot of work into configuring a jekyll blog I cloned from Scott Chamberlain over at ROpenSci and I will just use orgmode to publish the posts related to each project, tagged as ‘categories’.

But here is a problem I just found out how to solve. For a long time I thought that because github disabled ruby plugins that the automatic generate categories index pages was broken. Luckily Charlie Park has written up the following solution and this seems to have worked for me today:

Cheers!

Posted in  research methods


toward-a-unified-ecology-dataset

Ecology datasets

These are my notes from a meeting that Kathryn and I had with a group of Ecologists at the ANU (primarily Luciana Porforio and Nasreen Khan). We asked them to discuss how they search for and use Ecology datasets, especially how to best package up the parts of an ecological field data collection (ie weather, vegetation, biodiversity, soils, topography etc).

Lu started off the discussion by stating that the most important thing to acknowledge is that every ecologist will start off with a main research question and then search for data that will address their specific research question. It is difficult to work from a ‘top-down’ perspective that hopes to pre-empt the range of possible questions. Lu felt that it may therefore be best to just keep all the data together in the biggest bundle that is possible and the end user can pick it apart once downloaded.

We explained that LTERN datasets can be quite expansive with many dimensions and it seemed preferable to at least untangle the main ‘themes’ for packaging up.

Nasreen pointed out that there is always a protocol for how data are collected and this should give the data collection it’s structure. However I felt that ecology collections are so diverse they have been made (by necessity) very flexible and specific to the needs of the individual plot network. Therefore generalisations across data collections are very hard to make (apart from easy things like “weather” or “aboveground dead biomass”).

Toward a Unified Ecology

I always fall back on the text book “Toward a Unified Ecology: Timothy F. H. Allen, Thomas W. Hoekstra 1992”. I wondered if it can guide us? On pages 42-53 they describe the following framework and use the image below (the letters in the middle disc correspond to the criteria ie O = Organism). In this framework it is possible to summarise ANY ecological study as they ALWAYS incorporate these scale-independent criteria:

  • Organism: genetic integrity, discrete body, autonomy from other organisms
  • Population: relative similarity within the group
  • Community: inter-species competition, interference, mutualism
  • Ecosystem: biotic and abiotic interactions
  • Landscape: spatial structure/contiguity
  • Biome: characteristic physiognomy, disturbance and climate

I thought that if these dimensions were identified in a data collection first then they might become the discrete packages by which each plot network publishes their collection?

datadoco-layercake.png

Aekos

Over at AEKOS they have a similar conceptual framework

Observations can range from that of individual organisms and
interactions, through to populations, communities, ecosystems and
across broad global landscapes.

Conclusions

This is an open issue. More discussions are needed internally for the Data Custodians.

Lu also pointed out that the end users are the key stakeholders and perhaps more input from them (via surveys and workshops?) is needed?

Posted in  Data Documentation