Spatiotemporal multilevel modelling


Table of Contents

1 Introduction

1.1 Overview

In this notebook we will introduce what multi-level data and models are and why they are useful in epidemiology contexts. We begin with the general linear model framework GLM and then move to mixed-effects models. It is important to realise that these models are called different things in different disciplines: (generalized) linear mixed (effects) models (GLMM), hierarchical (generalized) linear models, etc. The terminology for the model parameters is equally diverse, usually including the terms random effects and fixed effects. So we focus on the key concepts.

We will cover what is a Random Effect and how it differs from a Fixed effect. Some example syntax in R (on a secure website portal we have configured) and Stata will show how to get a handle on models that have random intercepts and additionally, random slopes.

We will spend a little time talking about how to partition variation and get estimates of the random and fixed effects. An important element will be our discussion of similarities between mixed-effects models with basic regression. There will also be brief discussion of extending the regression to non-gaussian responses (GLMERs).

We will discuss how to interpret these results in a population health context.

An online analysis server (Rstudio in a web browser) and data repository (geoserver) are also provided

2 Compare basic regression with mixed-effects modelling

2.1 Terminology

When discussing multilevel modelling in this workshop we will avoid using the names 'fixed effects' and 'random effects'. Instead we use the more verbose phrase: intercepts/coefficients that are common across groups and intercepts/coefficients that vary by group. This is because, as noted in the introduction to Gelman and Hill 2007 (recommended reading) it is very problematic that these terms are used in imprecise and confusing ways that vary across disciplines. This concern also appears in the work of the rstanarm package developers and in the vignette 'Estimating Generalized Linear Models with Group-Specific Terms with rstanarm' who state:

'Models with this structure are referred to by many names: multilevel models, (generalized) linear mixed (effects) models (GLMM), hierarchical (generalized) linear models, etc. The terminology for the model parameters is equally diverse… [and commonly used terms] are not only misleading but also defined differently across the various fields in which these models are applied.'

2.2 Introduce the linear model and lmer

3 Intercepts and coefficients that vary by group

3.1 Ross River virus and urban vs rural mosquito habitat

See domodelcheckingglmer

#########################

4 Spatially structured timeseries

I will use the NMMAPSlite datasets for a simple example of what I describe as "Spatially Structured Timeseries" as opposed to "Spatio-Temporal" which I think more explicitly includes spatial structure in the model.

4.1 Outcome Data

4.1.1 Original Data

"
http://cran.r-project.org/web/packages/NMMAPSlite/index.html
Package ‘NMMAPSlite’ was removed from the CRAN repository.
Formerly available versions can be obtained from the archive.
Archived on 2013-05-11 at the request of the maintainer.

The Archived versions do not seem to work either.

That is such a shame, lucky I saved some of the data using the following code:
"
   
#### Code: get nmmaps data
# func
if(!require(NMMAPSlite)) install.packages('NMMAPSlite');require(NMMAPSlite)

######################################################
# load  
setwd('data')
initDB('data/NMMAPS') # this requires that we connect to the web,
                      # so lets get local copies
setwd('..')
cities <- getMetaData('cities')
head(cities)
citieslist <- cities$cityname
# write out a few cities for access later
for(city_i in citieslist[sample(1:nrow(cities), 9)])
{
 city <- subset(cities, cityname == city_i)$city
 data <- readCity(city)
 write.table(data, file.path('data', paste(city_i, '.csv',sep='')),
 row.names = F, sep = ',')
}
# these are all tiny, go some big ones
for(city_i in c('New York', 'Los Angeles', 'Madison', 'Boston'))
{
 city <- subset(cities, cityname == city_i)$city
 data <- readCity(city)
 write.table(data, file.path('data', paste(city_i, '.csv',sep='')),
 row.names = F, sep = ',')
}

4.1.2 Pooled Dataset

################################################################
# name:Pooled Dataset
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")

flist <- dir("data", full.names=T)

flist <-    flist[which(basename(flist) %in%
                        c("Baton Rouge.csv",
                          "Los Angeles.csv",
                          "Tucson.csv",
                          "Denver.csv")
                        )
                  ]
flist
for(f_i in 1:length(flist))
    {
        #f_i <- 2
        fi <- flist[f_i]
        df <- read.csv(fi)
        df <- df[,c("city","date", "agecat",
                    "cvd", "resp", "tmax",
                    "tmin", "dptp")]
        # str(df)
        write.table(df,
                    "outcome.csv", sep = ",",
                    row.names = F, append = f_i > 1,
                    col.names = f_i ==1
                    )
    }

4.2 Exposure Data

4.3 Zones

4.3.1 Map Shapefile Code

################################################################
# name:zones
# func
setwd("~/Dropbox/projects/spatiotemporal-regression-models/NMMAPS-example")
#require(devtools)
#install_github("gisviz", "ivanhanigan")
require(gisviz)

# load
flist <- dir("data")
flist
# do    
## geocode
flist <- gsub(".csv", "", flist)    
flist_geo <- gGeoCode2(flist)
flist_geo

## plot
## png("images/nmmaps-eg-cities.png")
## plotMyMap(
##     flist_geo[,c("long","lat")],
##     xl = c(-130,-60), yl = c(25,50)
##     )
## text(flist_geo$long, flist_geo$lat, flist_geo$address, pos=3)
## dev.off()
   
# load city names
flist2 <- dir("data")
flist2
city_codes <- matrix(NA, nrow = 0, ncol = 2)
for(fi in 1:length(flist2))
    {
        # fi <- 1
        fname <- flist2[fi]
        print(fi); print(fname);
        df <- read.csv(
            file.path("data", fname),
            stringsAsFactors = F, nrow = 1)
        city_codes <- rbind(city_codes,
            c(gsub(".csv","",fname), df$city)
            )
    }
city_codes <- as.data.frame(city_codes)
names(city_codes) <- c("address","city")
city_codes 
flist_geo2 <- merge(flist_geo, city_codes, by = "address")
flist_geo2

## make shapefile
epsg <- make_EPSG()
prj_code <- epsg[grep("WGS 84$", epsg$note),]
prj_code
shp <- SpatialPointsDataFrame(cbind(flist_geo2$long,flist_geo2$lat),flist_geo2,
                              proj4string = CRS(
                                  epsg$prj4[which(epsg$code ==prj_code$code)]
                                  )
                              )
writeOGR(shp, 'cities.shp', 'cities', driver='ESRI Shapefile')

4.3.2 Map Shapefile 2

4.3.3 Map Shapefile Output

NMMAPS-example/images/nmmaps-eg-cities.png

4.4 Population Data

4.4.1 Population data

  • A Code Skeleton
    The process for getting the population data from the websites is a bit of a pain, with repeated copy and paste operations. To make sure this is recorded I set up a skeleton to paste into.
  • TODO better to use gsub() to remove the ","
    ################################################################
    # name:population-data
    
    # func
    setwd("~/projects/spatiotemporal-regression-models")
    require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    cityname <- "SELECTED CITY"
    
    # do
    # save url
    # xxx
    population_input <- read.table(textConnection(
    "age: population
    xxx
    "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Baton Rouge
    ################################################################
    # name:population-data
    # first city, remove population data.frame
    rm(population)
    
    # func
    setwd("~/projects/spatiotemporal-regression-models")
    require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    cityname <- "Baton Rouge"
    
    # do
    # save url
    # http://www.city-data.com/us-cities/The-South/Baton-Rouge-Population-Profile.html
    population_input <- read.table(textConnection(
    "age: population
    Population Under 5 years: 15502
    Population 5 to 9 years: 15609
    Population 10 to 14 years: 15248
    Population 15 to 19 years: 21954
    Population 20 to 24 years: 27230
    Population 25 to 34 years: 31719
    Population 35 to 44 years: 30343
    Population 45 to 54 years: 27166
    Population 55 to 59 years: 9495
    Population 60 to 64 years: 7490
    Population 65 to 74 years: 13312
    Population 75 to 84 years: 9611
    Population 85 years and over: 3139
    "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Los Angeles
    ################################################################
    # name:population-data
    
    # func
    #setwd("~/projects/spatiotemporal-regression-models")
    #require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    cityname <- "Los Angeles"
    
    # do
    # save url
    # http://www.city-data.com/us-cities/The-West/Los-Angeles-Population-Profile.html
    population_input <- read.table(textConnection(
    "age: population
    Population under 5 years old: 285976
    Population 5 to 9 years old: 297837
    Population 10 to 14 years old: 255604
    Population 15 to 19 years old: 251632
    Population 20 to 24 years old: 299906
    Population 25 to 34 years old: 674098
    Population 35 to 44 years old: 584036
    Population 45 to 54 years old: 428974
    Population 55 to 59 years old: 143965
    Population 60 to 64 years old: 115663
    Population 65 to 74 years old: 18711
    Population 75 to 84 years old: 125829
    Population 85 years and over: 44189
    "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Tucson
    ################################################################
    # name:population-data
    
    # func
    #setwd("~/projects/spatiotemporal-regression-models")
    #require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    cityname <- "Tucson"
    
    # do
    # save url
    # http://www.city-data.com/us-cities/The-West/Tucson-Population-Profile.html
    population_input <- read.table(textConnection(
    "age: population
    Population under 5 years old: 35201
    Population 5 to 9 years old: 34189
    Population 10 to 14 years old: 31939
    Population 15 to 19 years old: 38170
    Population 20 to 24 years old: 47428
    Population 25 to 34 years old: 76394
    Population 35 to 44 years old: 72289
    Population 45 to 54 years old: 57608
    Population 55 to 59 years old: 19597
    Population 60 to 64 years old: 16056
    Population 65 to 74 years old: 29117
    Population 75 to 84 years old: 21394
    Population 85 years and older: 7317
    "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Denver
    ################################################################
    # name:population-data
    
    # func
    #setwd("~/projects/spatiotemporal-regression-models")
    #require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    cityname <- "Denver"
    
    # do
    # save url
    # http://www.city-data.com/us-cities/The-West/Denver-Population-Profile.html
    
    population_input <- read.table(textConnection(
    "age: population
    Population under 5 years old: 37769
    Population 5 to 9 years old: 34473
    Population 10 to 14 years old: 31315
    Population 15 to 19 years old: 32259
    Population 20 to 24 years old: 45534
    Population 25 to 34 years old: 113676
    Population 35 to 44 years old: 86420
    Population 45 to 54 years old: 71000
    Population 55 to 59 years old: 22573
    Population 60 to 64 years old: 17191
    Population 65 to 74 years old: 30643
    Population 75 to 84 years old: 23369
    Population 85 years and over: 8414
    "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Louisville
    ################################################################
    # name:population-data
    
    # func
    setwd("~/projects/spatiotemporal-regression-models")
    require(gisviz)
    
    # load
    ## citycodes
    shp <- readOGR("cities.shp", "cities")
    shp@data
    
    cityname <- "Louisville"
    
    # do
    # save url
    #http://www.city-data.com/us-cities/The-South/Louisville-Population-Profile.html
    
    ## NB error in table, 75-84 was missing.  used same from baton rouge
    population_input <- read.table(textConnection(
      "age: population
      Population under 5 years old: 16926
      Population 5 to 9 years old: 17359
      Population 10 to 14 years old: 16627
      Population 15 to 19 years old: 17362
      Population 20 to 24 years old: 18923
      Population 25 to 34 years old: 37541
      Population 35 to 44 years old: 40354
      Population 45 to 54 years old: 33755
      Population 55 to 59 years old: 10716
      Population 60 to 64 years old: 9211
      Population 65 to 74 years old: 18577
      Population 75 to 84 years: 9611
      Population 85 years and older: 5075
      "), sep = ":", header = TRUE)
    
    ## agecats
    ## 65to74     75p under65  
    population_input$agecat <- c(rep("under65", 10),
                                 rep("65to74"),
                                 rep("75p",2)
                                 )
    
    citycode <- subset(shp@data, address == cityname,
                       select = city)
    
    population_input$city <- rep(
        as.character(citycode[1,1])        
        , 13
        )
    
    population_input
    if(exists("population"))
        {
            population <- rbind(population,
                                population_input
                                )        
        } else {
            population <- population_input
        }
    
    
  • Save Population Dataset
    ################################################################
    # name:save-population
    write.csv(population, "population.csv",
              row.names = FALSE
              )
    

4.4.2 Population Summary

################################################################
# name:Population Summary
# func
require(plyr)

# load
population <- read.csv("population.csv")
str(population)
population <- population[,c("city", "agecat", "population")]
# do
population_summary <- ddply(population,
                           .variables = c("city",
                               "agecat"),
                           .fun = summarise,
                           sum(population)
                           )

names(population_summary) <- c("city","agecat","pop")

write.csv(population_summary,
          "population_summary.csv",
          row.names = FALSE
          )

4.5 Merge

################################################################
# name:merge
# func
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")

# load
outcome <- read.csv("outcome.csv")
str(outcome)
outcome$date <- as.Date(outcome$date)

population <- read.csv("population_summary.csv")
str(population)
population

# do
analyte <- merge(outcome, population, by = c("city", "agecat"))
analyte <- arrange(analyte, city, date, agecat)
# check
subset(analyte, date == as.Date("1990-01-01"))

# save
write.csv(analyte, "analyte.csv", row.names = FALSE)

4.6 Data Checking

4.6.1 Check Assumption of Proportional Hazards

# Checking the proportional hazards assumption for Indirect Age Standardisation

# Background
## Indirect SMRs from different index/study populations are not strictly comparable
## because they are calculated using different weighting schemes that
## depend upon the age structures of the index/study populations
## (see http://www.statsdirect.com/webhelp/#rates/smr.htm).

## Indirect SMRs can be compared if you make the assumption that the
## ratio of rates between index and reference populations is constant;
## this is similar to the assumption of proportional hazards in Cox
## regression (Armitage and Berry, 1994).

## So we need to check if the rate ratio of the study population(s)
## compared to the standard population varies substantially with age.
## If not the proportional hazards assumption holds for the standard
## rates compared with the observed rates and the Indirect SMRs are
## comparable.  To do this check we calculate the Annualised Age Specific
## Rates for our study areas and for our standard for several years at
## periodic timepoints across the study period, and then calculate the
## ratio of these at each timepoint we could reassure our selves that
## this assumption holds.

## An additional issue arises when there are non-negligible differences
## in the age distributions of the study population(s) and the standard
## population. In this situation, indirect standardisation produces
## biased results due to residual confounding by age

## Also see Australian Institute of Health and Welfare. (2011). Principles on
## the use of direct age-standardisation in administrative data
## collections For measuring the gap between Indigenous and
## non-Indigenous Australians. Data linkage series. Cat. no. CSI 12.
## Canberra: AIHW. Retrieved from
## http://www.aihw.gov.au/publication-detail/?id=10737420133

################################################################
# func
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")
require(plyr)

# load
analyte <- read.csv("analyte.csv")

# clean
head(analyte)
analyte$yy  <- substr(analyte$date, 1, 4)
  
# do
## first we want to see if the age specific rates vary across the
## study sites
disease_colname <- "cvd"
pop_colname <- "pop"
by_cols  <- c("city", "agecat", "yy")

stdysites  <- ddply(analyte, by_cols,
                    function(df) return(
                      c(observed = sum(df[,disease_colname]),
                        pop = mean(df[,pop_colname]),
                        crude.rate = sum(df[,disease_colname])/mean(df[,pop_colname])
                        )
                      )
                    )
## check
head(stdysites)
subset(stdysites, yy == 1987 & city == "batr")

## define the subset we will use just the deaths and pop in 2000
stdy <- subset(stdysites, yy == 2000)
stdy

## now define the standard population as the entire country
standard  <- ddply(stdy, c("agecat"),
                     function(df) return(
                       c(observed = sum(df[,"observed"]),
                         pop = sum(df[,"pop"]),
                         crude.rate = sum(df[,"observed"])/
                           sum(df[,"pop"])
                         )
                       )
                    )
standard

## Merge the studysites and the standard
stdyByStnd  <- merge(stdy, standard, by = "agecat")
stdyByStnd

## plot the rate ratios
png("images/ratio-stdy-by-stnd.png")
mp <- barplot(stdyByStnd$crude.rate.x/stdyByStnd$crude.rate.y)
text(mp, par("usr")[3],
     labels = paste(stdyByStnd$agecat,stdyByStnd$city),
     srt = 45, adj = c(1.1,1.1), xpd = TRUE
     )
abline(1,0)
dev.off()

## we can see that LA has an issue with the 65to74 agecat

## Second we will check if ratio of the proportions in each population
## agecat vary between study sites and the standard
totals <- ddply(stdyByStnd, c("city"), summarise,
                sum(pop.x)
                )
totals <- merge(stdyByStnd[,c("city","agecat","pop.x")], totals)
totals$pop.wt  <- totals[,3] / totals[,4]
totals <- arrange(totals, city, agecat)
totals

totalsStnd <- ddply(stdyByStnd, c("agecat"), summarise,
                sum(pop.x)
                )
totalsStnd$totalPop <- sum(totalsStnd[,2])
totalsStnd$pop.wt.total <- totalsStnd[,2]/totalsStnd[,3]
totalsStnd

## merge these so we can look at the ratios
totals2 <- merge(totals, totalsStnd, by = "agecat")
totals2 <- arrange(totals2, agecat, city)

## now plot the ratios
png("images/ratio-stdy-by-stnd-pops.png")
mp <- barplot(totals2$pop.wt/totals2$pop.wt.total)
text(mp, par("usr")[3],
     labels = paste(totals2$agecat,totals2$city),
     srt = 45, adj = c(1.1,1.1), xpd = TRUE
     )
abline(1,0)
dev.off()

## and we can see that there are far fewer 65to74 aged persons in LA
## than expected.


4.6.2 Incidence Rate Ratios Plot

NMMAPS-example/images/ratio-stdy-by-stnd.png

4.6.3 Population Rate Ratios Plot

NMMAPS-example/images/ratio-stdy-by-stnd-pops.png

4.7 Exploratory Data Analyses

4.7.1 Time-series Plots Code

################################################################
# name:eda-tsplots
# func
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")
  
# load
flist <- dir("data")
fname <- flist[7]; print(fname)
df <- read.csv(file.path("data", fname))

# clean
str(df)
summary(df$tmax); summary(df$dptp)

# do
## we will only consider cities with long periods temp and humidity observed
png("images/nmmaps-eg-dateranges.png", width = 1000, height=500, res = 150)
par(mfrow=c(6,5), mar=c(0,3,3,0), cex=.25)
for(fi in 1:length(flist))
    {
        #fi <- 4
        fname <- flist[fi]
        df <- read.csv(file.path("data", fname))
        print(fi); print(fname);
        with(df, plot(as.Date(date), tmax, type = "l"))
        title(paste(fname, "tmax"))
        with(df, plot(as.Date(date), dptp, type = "l"))
        title(paste(fname, "dptp"))
    }
dev.off()

4.7.2 Time-series Plots Output

NMMAPS-example/images/nmmaps-eg-dateranges.png

4.8 Main Analyses

4.8.1 Core Model

################################################################
# name:core
# func
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")
require(mgcv)
require(splines)

# load
analyte <- read.csv("analyte.csv")

# clean
analyte$yy <- substr(analyte$date,1,4)
numYears<-length(names(table(analyte$yy)))
analyte$date <- as.Date(analyte$date)
analyte$time <- as.numeric(analyte$date)
analyte$agecat <- factor(analyte$agecat,
                          levels = c("under65",
                              "65to74", "75p"),
                          ordered = TRUE
                          )

# do
fit <- gam(cvd ~ s(tmax) + s(dptp) +
           city + agecat +
           s(time, k= 7*numYears, fx=T) +
           offset(log(pop)),
           data = analyte, family = poisson
           )

# plot of response functions
png("images/nmmaps-eg-core.png", width = 1000, height = 750, res = 150)
par(mfrow=c(2,3))
plot(fit, all.terms = TRUE)
dev.off()


4.8.2 Core Model Plots

NMMAPS-example/images/nmmaps-eg-core.png

4.8.3 Model Selection

   
# The following codes are just a dump of stuff I've found useful in the past.
# This is all time-series methods.
# TODO replace the city-specific model with a pooled analysis with city specific trend.

# func
require(mgcv)
require(splines)
  
######################################################
# load  
dir("data")
city <- "Chicago"
data <- read.csv(sprintf("data/%s.csv", city), header=T)
str(data)
data$yy <- substr(data$date,1,4)
data$date <- as.Date(data$date)
######################################################
# check
par(mfrow=c(2,1), mar=c(4,4,3,1))
with(subset(data[,c(1,15:25)], agecat == '75p'),
  plot(date, tmax)
 )
with(subset(data[,c(1,4,15:25)], agecat == '75p'),
        plot(date, cvd, type ='l', col = 'grey')
        )
with(subset(data[,c(1,4,15:25)], agecat == '75p'),
        lines(lowess(date, cvd, f = 0.015))
        )
# I am worried about that outlier
data$date[which(data$cvd > 100)]
# [1] "1995-07-15" "1995-07-16"
 
######################################################
# do standard NMMAPS timeseries poisson GAM model
numYears<-length(names(table(data$yy)))
df <- subset(data, agecat == '75p')
df$time <- as.numeric(df$date)
fit <- gam(cvd ~ s(pm10tmean) + s(tmax) + s(dptp) + s(time, k= 7*numYears, fx=T), data = df, family = poisson)
# plot of response functions
par(mfrow=c(2,2))
plot(fit)
dev.off()
 
######################################################
# some diagnostics
summary(fit)
# note the R-sq.(adj) =   0.21
gam.check(fit)
# note the lack of a leverage plot.  for that we need glm
 
######################################################
# do same model as glm
fit2 <- glm(cvd ~ pm10tmean + ns(tmax, df = 8) + ns(dptp, df = 4) + ns(time, df = 7*numYears), data = df, family = poisson)
# plot responses
par(mfrow=c(2,2))
termplot(fit2, se =T)
dev.off()
 
# plot prediction
df$predictedCvd <- predict(fit2, df, 'response')
# baseline is given by the intercept
fit3 <- glm(cvd ~ 1, data = df, family = poisson)
df$baseline <-  predict(fit3, df, 'response')
with(subset(df, date>=as.Date('1995-01-01') & date <= as.Date('1995-07-31')),
 plot(date, cvd, type ='l', col = 'grey')
        )
with(subset(df, date>=as.Date('1995-01-01') & date <= as.Date('1995-07-31')),
        lines(date,predictedCvd)
        )
with(subset(df, date>=as.Date('1995-01-01') & date <= as.Date('1995-07-31')),
 lines(date,baseline)
        )
######################################################
# some diagnostics
# need to load a function to calculate poisson adjusted R squared
# original S code from
# The formula for pseudo-R^2 is taken from G. S. Maddalla,
# Limited-dependent and Qualitative Variables in Econometrics, Cambridge:Cambridge Univ. Press, 1983. page 40, equation 2.50.
RsquaredGlm <- function(o) {
 n <- length(o$residuals)
 ll <- logLik(o)[1]
 ll_0 <- logLik(update(o,~1))[1]
 R2 <- (1 - exp(((-2*ll) - (-2*ll_0))/n))/(1 - exp( - (-2*ll_0)/n))
 names(R2) <- 'pseudo.Rsquared'
 R2
 }
RsquaredGlm(fit2)
# 0.51
# the difference is presumably due to the arguments about how to account for unexplainable variance in the poisson distribution?
 
# significance of spline terms
drop1(fit2, test='Chisq')
# also note AIC. best model includes all of these terms
# BIC can be computed instead (but still labelled AIC) using
drop1(fit2, test='Chisq', k = log(nrow(data)))
 
# diagnostic plots
par(mfrow=c(2,2))
plot(fit2)
dev.off()
# note high leverage plus residuals points are labelled
# leverage doesn't seem to be too high though which is good
# NB the numbers refer to the row.names attribute which still refer to the original dataset, not this subset
df[row.names(df) %in% c(9354,9356),]$date
# as suspected [1] "1995-07-15" "1995-07-16"
 
######################################################
# so lets re run without these obs
df2 <- df[!row.names(df) %in% c(9354,9356),]
# to avoid duplicating code just re run fit2, replacing data=df with df2
# tmax still significant but not so extreme
# check diagnostic plots again
par(mfrow=c(2,2))
plot(fit2)
dev.off()
# looks like a well behaved model now.
 
# if we were still worried about any high leverage values we could identify these with
df3 <- na.omit(df2[,c('cvd','pm10tmean','tmax','dptp','time')])
df3$hatvalue <- hatvalues(fit2)
df3$res <- residuals(fit2, 'pearson')
with(df3, plot(hatvalue, res))
# this is the same as the fourth default glm diagnostic plot, which they label x-axis as leverage
summary(df3$hatvalue)
# gives us an idea of the distribution of hat values
# decide on a threshold and look at it
hatThreshold <- 0.1
with(subset(df3, hatvalue > hatThreshold), points(hatvalue, res, col = 'red', pch = 16))
abline(0,0)
segments(hatThreshold,-2,hatThreshold,15)
dev.off()
 
fit3 <- glm(cvd ~ pm10tmean + ns(tmax, df = 8) + ns(dptp, df = 4) + ns(time, df = 7*numYears), data = subset(df3, hatvalue < hatThreshold), family = poisson)
par(mfrow=c(2,2))
termplot(fit3, se = T)
# same same
plot(fit3)
# no better
 
# or we could go nuts with a whole number of ways of estimating influence
# check all influential observations
infl <- influence.measures(fit2)
# which observations 'are' influential
inflk <- which(apply(infl$is.inf, 1, any))
length(inflk)
 
 
######################################################
# now what about serial autocorrelation in the residuals?
 
par(mfrow = c(2,1))
with(df3, acf(res))
with(df3, pacf(res))
dev.off()
 
 
 
######################################################
# just check for overdispersion
fit <- gam(cvd ~ s(pm10tmean) + s(tmax) + s(dptp) + s(time, k= 7*numYears, fx=T), data = df, family = quasipoisson)
summary(fit)
# note the Scale est. = 1.1627
# alternatively check the glm
fit2 <- glm(cvd ~ pm10tmean + ns(tmax, df = 8) + ns(dptp, df = 4) + ns(time, df = 7*numYears), data = df, family = quasipoisson)
summary(fit2)
# (Dispersion parameter for quasipoisson family taken to be 1.222640)
# this is probably near enough to support a standard poisson model...
 
# if we have overdispersion we can use QAIC (A quasi- mode does not have a likelihood and so does not have an AIC,  by definition)
# we can use the poisson model and calculate the overdispersion
fit2 <- glm(cvd ~ pm10tmean + ns(tmax, df = 8) + ns(dptp, df = 4) + ns(time, df = 7*numYears), data = df, family = poisson)
1- pchisq(deviance(fit2), df.residual(fit2))
 
# QAIC, c is the variance inflation factor, the ratio of the residual deviance of the global (most complicated) model to the residual degrees of freedom
c=deviance(fit2)/df.residual(fit2)
QAIC.1=-2*logLik(fit2)/c + 2*(length(coef(fit2)) + 1)
QAIC.1
 
# Actually lets use QAICc which is more conservative about parameters,
QAICc.1=-2*logLik(fit2)/c + 2*(length(coef(fit2)) + 1) + 2*(length(coef(fit2)) + 1)*(length(coef(fit2)) + 1 + 1)/(nrow(na.omit(df[,c('cvd','pm10tmean','tmax','dptp','time')]))- (length(coef(fit2))+1)-1)
QAICc.1
 
 
######################################################
# the following is old work, some may be interesting
# such as the use of sinusoidal wave instead of smooth function of time
 
 
# # sine wave
# timevar <- as.data.frame(names(table(df$date)))
# index <- 1:length(names(table(df$date)))
# timevar$time2 <- index / (length(index) / (length(index)/365.25))
# names(timevar) <- c('date','timevar')
# timevar$date <- as.Date(timevar$date)
# df <- merge(df,timevar)
 
# fit <- gam(cvd ~ s(tmax) + s(dptp) + sin(timevar * 2 * pi) + cos(timevar * 2 * pi) + ns(time, df = numYears), data = df, family = poisson)
# summary(fit)
# par(mfrow=c(3,2))
# plot(fit, all.terms = T)
# dev.off()
 
# # now just explore the season fit
# fit <- gam(cvd ~ sin(timevar * 2 * pi) + cos(timevar * 2 * pi) + ns(time, df = numYears), data = df, family = poisson)
# yhat <- predict(fit)
# head(yhat)
 
# with(df, plot(date,cvd,type = 'l',col='grey', ylim = c(15,55)))
# lines(df[,'date'],exp(yhat),col='red')
 
 
# # drop1(fit, test= 'Chisq')
 
# # drop1 only works in glm?
# # fit with weather variables, use degrees of freedom estimated by gam
# fit <- glm(cvd ~ ns(tmax,8) + ns(dptp,2) + sin(timevar * 2 * pi) + cos(timevar * 2 * pi) + ns(time, df = numYears), data = df, family = poisson)
# drop1(fit, test= 'Chisq')
# # use plot.glm for diagnostics
# par(mfrow=c(2,2))
# plot(fit)
# par(mfrow=c(3,2))
# termplot(fit, se=T)
# dev.off()
 
# # cyclic spline, overlay on prior sinusoidal
# with(df, plot(date,cvd,type = 'l',col='grey', ylim = c(0,55)))
# lines(df[,'date'],exp(yhat),col='red')
 
# df$daynum <- as.numeric(format(df$date, "%j"))
# df[360:370,c('date','daynum')]
# fit <- gam(cvd ~ s(daynum, k=3, fx=T, bs = 'cp') +  s(time, k = numYears, fx = T), data = df, family = poisson)
# yhat2 <- predict(fit)
# head(yhat2)
 
# lines(df[,'date'],exp(yhat2),col='blue')
 
 
# par(mfrow=c(1,2))
# plot(fit)
 
 
# # fit weather with season
# fit <- gam(cvd ~ s(tmax) + s(dptp) +
  # s(daynum, k=3, fx=T, bs = 'cp') +  s(time, k = numYears, fx = T), data = df, family = poisson)
# par(mfrow=c(2,2))
# plot(fit)
 
# summary(fit)
  

5 Spatial lag and timeseries

I will use the same NMMAPSlite to show how I'd approach a simple "Spatio-Temporal" model.

  • TODO The following is a stub of an idea. For further development

5.1 Data

5.1.1 Zones

  • Spatial Neighbours Code
    ################################################################
    # name:spatwat
    # func
    setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")
    require(gisviz)
    
    # load
    dir()
    shp <- readOGR("cities.shp", "cities")
    
    # clean
    head(shp@data)
    
    # do
    ## I will use nearest neighbour within a distance threshold, but
    ## usually polygon datasets would use poly2nb
    nb <- dnearneigh(shp, d1 = 1, d2 = 1000)                      
    head(nb)
    shp[[1]][1]
    shp[[1]][nb[[1]]]
    # map
    nudge <- 10
    png("images/nmmaps-eg-neighbourhood.png")
    plotMyMap(
        shp@coords, xl = c(min(shp@coords[,1])-nudge, max(shp@coords[,1])+nudge),
        yl = c(min(shp@coords[,2])-nudge, max(shp@coords[,2])+nudge)
        )
    plot(nb, shp@coords, add=TRUE)
    text(shp@data$long, shp@data$lat, shp@data$address, pos=3)
    points(shp@coords[nb[[1]],], col = 'green', pch = 16)
    points(shp@data[1,c("long","lat")],col = 'blue', pch = 16)
    dev.off()
    
    
    
  • Spatial Neighbours Output
    NMMAPS-example/images/nmmaps-eg-neighbourhood.png

5.1.2 Outcome

#### Now do the Neighbourhood autoregressive lagged variable  ####
# func
setwd("~/projects/spatiotemporal-regression-models/NMMAPS-example")
require(gisviz)
require(plyr)

# load
analyte <- read.csv("analyte.csv")
shp <- readOGR("cities.shp", "cities")

# clean
analyte$yy <- substr(analyte$date,1,4)
numYears<-length(names(table(analyte$yy)))
analyte$date <- as.Date(analyte$date)
analyte$time <- as.numeric(analyte$date)
analyte$agecat <- factor(analyte$agecat,
                          levels = c("under65",
                              "65to74", "75p"),
                          ordered = TRUE
                          )
 
names(analyte)
analyte[1,]
table(analyte$city)
# do
study <- analyte[, c("city","date", "agecat", "cvd", "pop")]
subset(study, city == "tucs" & date == as.Date("1991-01-01"))

nb <- dnearneigh(shp, d1 = 1, d2 = 1000)                      
head(nb)
head(shp@data)
adj <- adjacency_df(NB = nb, shp = shp, zone_id = 'city')
subset(adj, V1 == "tucs")

neighbours <- merge(study, adj, by.x = "city", by.y = "V1")
subset(neighbours, city == "tucs" & date == as.Date("1991-01-01"))

xvars <- c("V2", "date","agecat")
yvars <- c("city", "date", "agecat")
neighbours <- merge(neighbours[,c(xvars, "city")],
                    analyte[,c(yvars, "cvd", "pop")],
                    by.x = xvars,
                    by.y = yvars)
names(neighbours)
subset(neighbours, city == "tucs" & date == as.Date("1991-01-01"))

neighbours$asr  <- (neighbours$cvd / neighbours$pop) * 1000


neighbours2  <- ddply(neighbours, c("city", "date", "agecat"), summarise,
                     NeighboursYij  = mean(asr)
                     )
subset(neighbours2, city == "tucs" & date == as.Date("1991-01-01"))
table(neighbours2$city)
head(analyte)

analyte  <- merge(analyte, neighbours2, by = c("city", "agecat", "date"))
analyte <- arrange(analyte, city, date, agecat)
head(analyte)


5.2 Analysis

5.2.1 Model with Spatial Lag

# func
require(splines)

# load
# assumes the prior code chunks have been run

# do
fit <- glm(cvd ~ ns(tmax, df = 3) + ns(dptp, df = 3) +
           NeighboursYij + 
           city + agecat +
           ns(time, df = 7*numYears) +
           offset(log(pop)),
           data = analyte, family = poisson
           )

# plot of response functions
png("images/nmmaps-eg-sp-lag.png", width = 1000, height = 750, res = 150)
par(mfrow=c(2,3))
termplot(fit, terms = attr(terms(fit),'term.labels') [c(1:2,6)], se = TRUE, ylim =c(-.2,.2), col.term = 'black', col.se = 'black')
termplot(fit, terms = attr(terms(fit),'term.labels') [c(4,5)], se = TRUE, ylim =c(-3,3), col.term = 'black', col.se = 'black')
termplot(fit, terms = attr(terms(fit),'term.labels') [3], se = TRUE, ylim =c(-1,1), col.term = 'black', col.se = 'black')
dev.off()


5.2.2 Spatial Lag Model Plots

NMMAPS-example/images/nmmaps-eg-sp-lag.png