Disentangle Things by Ivan Hanigan

Welcome to my Open Notebook

This is an Open Notebook with Selected Content - Delayed. All content is licenced with CC-BY. Find out more Here.

Decisions to make when modelling interactions

This post focuses on decision making during statistical modelling. The case of investigating effect modification is used as an example. The analyst has several options available to them when constructing the model parametrisation for adjustment to explain modification effectively. Unlike confounding (in which the criterion of substantial magnitude change of the effect estimate when controlling for a third variable can be easily assessed), models including effect modification can be hard to interpret. Taking account of effect modification becomes increasingly important when modelling complex interactions.

If an effect moderator is present then the relationship between exposure and response varies between different levels of the moderator. An example is provided by a paper we published, in which the relationship between proximity to wetlands and Ross River virus infection was found to be different for towns and rural areas, where the ‘urban’ variable is the ‘effect moderator’.

There are three common ways for data analysts to address this question in statistical models:

multiple models for each group,
interaction terms or
re-parametrisation to explicitly depict interactions.

The first way is to split apart the dataset and conduct separate analyses of multiple groups. For example, one could run the regression in the urban zone and the rural zone seperately and see if there were different exposure response functions in the two models. This was the approach of our paper. This approach has the strength that it is simple to do and yeilds results that are easy to interpret. A limitation of this method is that by splitting the dataset one loses degrees of freedom and therefore statistical power.

The second approach (which removes that limitation of the first option) is to fit interaction terms. The example shown in Figure 1 is a multiplicative interaction model (Brambor 2006). This approach was not taken in the models reported in our paper, but can easily be implemented and shows that the function of distance was estimated to have different ‘slopes’ in each of the dichotomous urban groups.

The statistical method can be easily implemented in software by including a multiplicative term between two variables, however in practice the resulting post estimation regression outputs can be difficult to interpret. For example, say one wants to calculate the effect and standard error for exposure X on health outcome Y with the interaction of the effect modifier Z. The form of this model can be written as:

\[Y ~ \beta_{1}X + \beta_{2}Z + \beta_{3}XZ\]

where B1, B2 and B3 are the regression coefficients estimated and the term XZ is the interaction between exposure and effect modifier.

The difficulty for interpretation comes when using this method for calculating the marginal effect of X on Y and the conditional standard error. The specific method described in Brambor et al. is:

Calculate the coefficients and the variance-covariance matrix from the regression model
The marginal effect is $\beta_{1} + \beta_{3}XZ$ where Z is the level of the modifying factor (0 or 1 in the dichotomous effect modifier case)
The conditional standard error is:

\[\sqrt{var(\beta_{1}) + Z^2 var(\beta_{3}) + 2Zcov(\beta_{1}\beta_{3})}\]

Therefore the strengths of this approach is that it does not reduce degrees of freedom and is straightforward to specify the model in standard statistical software packages. The limitations are related to interpretation of the resulting coefficients for both the main effects and the marginal effects, and standard errors for these.

The third approach available to analysts makes it easier to access the resulting regression output. This method was employed in Paper 1 in the final modelling phase in which estimates were calculated for the different drought exposure-response funcitons in each of the sub-groups. In the terms of Figure \ref{fig:effectmod.png} it is simple to:

Calculate X1 = X * Z (i.e = exposure for condition is met, zero otherwise)
Calculate X0 = X * (1-Z) (i.e. = exposure for NON-condition, zero for condition is TRUE)
Instead of X, Z and XZ, fit X1, X0 and Z. This model also contains three parameters and captures the same interactions as it is the same model with a different parametrisation. The standard errors for the X1 and X0 are calculated directly from the regression.

This method is much easier to implement and interpret. This is also considerably more flexible than the other two approaches. A limitation remains for this method in that the pre-processing steps required are more complicated, and there are inherently more possibilities for the data analyst to make errors in writing their code as they make these changes to the analytical data.

# Demonstrating this with the data from the paper (available in table 2)
## the following code shows the different parametrisations for the effect modification by urban
## we show that the coeffs and se are equivalent but that the psuedo-R
## squared will be better when including all our data in stratified analysis

# model 0 effect in eastern
d_eastern
fit <- glm(cases~ buffer + offset(log(1+pops)),family='poisson', data=d_eastern )
summa <- summary(fit)
summa
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.82425    0.14451 -33.382  < 2e-16 ***
## buffer      -0.24702    0.07921  -3.119  0.00182 **

# model 1 effect in urban
fit1 <- glm(cases~ buffer + offset(log(1+pops)),family='poisson', data=d_urban )
summa <- summary(fit1)
summa
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.52363    0.33354 -16.561   <2e-16 ***
## buffer       0.03853    0.06352   0.607    0.544

# step 1, combine the urban and rural data
d_eastern$urban <- 0
d_urban$urban <- 1
dat2 <- rbind(d_eastern, d_urban)
str(dat2)
dat2

# model 2 is a multiplicative term
fit2 <- glm(cases ~ buffer * urban + offset(log(1+pops)), family = 'poisson', data = dat2)
summa <- summary(fit2)
summa
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -4.82425    0.14451 -33.382  < 2e-16 ***
## buffer       -0.24702    0.07921  -3.119  0.00182 **
## urban        -0.69938    0.36350  -1.924  0.05435 .
## buffer:urban  0.28555    0.10153   2.812  0.00492 **

# the coeff on buffer is for urban = 0 is main effect
# the coeff on buffer:urban is for urban = 1 is the marginal effect
b1 <- summa$coeff[2,1]
b3 <- summa$coeff[4,1]
b1 + b3
# 0.0385268
# but what about that p-value?  and the se?
str(fit2)
fit2_vcov <- vcov(fit2)
fit2_vcov
# now calculate the conditional standard error for the marginal effect of buffer for the value of the modifying variable (Z, urban =1)
varb1<-fit2_vcov[2,2]
varb3<-fit2_vcov[4,4]
covarb1b3<-fit2_vcov[2,4]
Z<-1
conditional_se <- sqrt(varb1+varb3*(Z^2)+2*Z*covarb1b3)
conditional_se



# model 3 is the re-parametrisation
dat2$buffer_urban <- dat2$buffer * dat2$urban
dat2$buffer_eastern <- dat2$buffer * (1-dat2$urban)

fit3 <- glm(cases ~ buffer_urban + buffer_eastern + urban + offset(log(1+pops)), family = 'poisson', data = dat2)
summa <- summary(fit3)
summa

## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -4.82425    0.14451 -33.382  < 2e-16 ***
## buffer_urban    0.03853    0.06352   0.607  0.54416
## buffer_eastern -0.24702    0.07921  -3.119  0.00182 **
## urban          -0.69938    0.36350  -1.924  0.05435 .

Posted in disentangle statistical modelling