########################
# GETTING STARTED WITH R
########################
# If necessary, download and install R from http://cran.r-project.org.
# Start R. If you have never installed the lme4 package, or need to update it,
# run the following command:
install.packages("lme4", dependencies = T)
# To run commands from a script like this, you can either cut-and-paste into the
# R Console window, or with your cursor on the desired line (or having highlighted
# multiple lines), press Ctrl-R (Windows) or Command-Enter (Mac).
# Each time you start R, you have to load (not install) the packages you want to use.
# This is done with the library() function. So to load lme4 (mixed-effects models):
library(lme4)
#############################################
# PART ONE: DATA VISUALIZATION AND REGRESSION
#############################################
# Fisher's iris data set (http://en.wikipedia.org/wiki/Iris_flower_data_set) is one
# of the "famous" data sets that are included with R. To load this data, you use
# the data() function.
data(iris)
# This command creates an object called "iris". To check, list the workspace:
ls()
class(iris) # data.frame
# The "iris" object is a data frame, which is the format that all data should be in
# to carry out regression analysis. Data frames have rows and columns. Each row is
# one observation, and each column represents a different variable. The difference
# between response and predictor variables is not indicated. The head() command is
# a good way to see the structure of the data. It lists the first six rows.
head(iris)
# We can make a few plots to see the relationship between variables. It is possible
# to customize almost everything about an R plot, but this is not always easy.
# We will stick to very simple plots. Many people use the ggplot2 package for plots.
plot(Sepal.Width ~ Sepal.Length, data = iris)
# This plot would be more informative if we colored the symbols by species...
plot(Sepal.Width ~ Sepal.Length, col = Species, data = iris)
# ...and added a primitive legend.
text(6.75, c(4.0, 4.1, 4.2), paste("Iris", levels(iris$Species)), col = 1:3)
# Two species overlap, while one is distinct. Let's try the same thing for petals.
plot(Petal.Width ~ Petal.Length, col = Species, data = iris)
text(2, c(2.1, 2.2, 2.3), paste("Iris", levels(iris$Species)), col = 1:3)
# We see that the species are well separated by the petal data. It looks like a
# straight line could be fit through all these points. This calls for regression.
# Instead of just calling the lm() function for linear regression and printing it
# to the console, we will assign it to an object, giving it the name "model1".
model1 <- lm(Petal.Width ~ Petal.Length, data = iris)
# If we run print(model1), or just model1, we can see the intercept and slope.
model1
# Call:
# lm(formula = iris$Petal.Width ~ iris$Petal.Length)
# Coefficients:
# (Intercept) iris$Petal.Length
# -0.3631 0.4158
# The intercept means that hypothetically if the petal length was zero centimeters,
# the petal width is predicted to be -0.3631 centimeters. The more meaningful number
# is the slope, which says that for every centimeter that the petal length increases,
# the petal width (the response variable here) increases by 0.4158 centimeters.
# We can plot this regression line in blue, using the abline() function.
abline(model1, col = "blue")
# The natural question is, would we get a significantly better fit with a more
# complex model? A more complex model always gives a better fit, but the question
# is whether it is significantly better, better than would be expected by chance.
model2 <- lm(Petal.Width ~ Petal.Length + Species, data = iris)
model2
# Call:
# lm(formula = Petal.Width ~ Petal.Length + Species, data = iris)
# Coefficients:
# (Intercept) Petal.Length Speciesversicolor
# -0.09083 0.23039 0.43537
# Speciesvirginica
# 0.83771
# model2 still has one slope associated with petal length, but it has a different
# intercept for each species. The overall intercept is for the "setosa" species,
# which is first alphabetically. The coefficients for "versicolor" and "virginica"
# are the differences between them and "setosa". The slope in this model is constant.
# We can plot the three regression lines, with some difficulty.
abline(coef(model2)[1], coef(model2)[2], col = "black")
abline(sum(coef(model2)[c(1, 3)]), coef(model2)[2], col = "red")
abline(sum(coef(model2)[c(1, 4)]), coef(model2)[2], col = "green")
# The fit looks surprisingly different. In part this is because many points are
# sitting on top of each other in the scatterplot (each species has 50 observations).
# To test whether model2 is significantly better than model1, we have two choices:
# We can compare the models directly with the anova() function:
anova(model1, model2, test = "Chisq")
# This returns a p-value of 5.482e-10, or 5.482 x 10^-10, or 0.0000000005482.
# In other words, model2 is a highly significant improvement over model1.
# We can also use the summary() function to examine model2.
summary(model2)
# This method shows a separate p-value for each species coefficient. It means
# that "versicolor" is significantly different from "setosa" (p = 4.04e-05), and
# that "virginica" is significantly different from "setosa" (p = 4.71e-08).
# To test "virginica" against "versicolor" we would have to relevel the variable.
# The above examples have been linear regression models, because the response
# variable is a real number, at least in theory. When the response is binary,
# we use logistic regression, which is almost exactly the same except instead of
# predicting a value of the response, given a certain combination of predictors,
# it predicts the log-odds of the response, which is the natural logarithm of
# the odds, p / (1 - p), the probability of one outcome divided by the probability
# of the other. A log-odds of zero means a probability of 0.5. A positive log-odds
# means a greater probability, and a negative log-odds means a lesser one. A rough
# rule of thumb: an effect of +/- 1 log-odds is medium-sized, +/- 2 is large.
# We can load the department store data gathered by Labov in 1962:
ds <- read.csv("http://www.danielezrajohnson.com/ds.csv")
# if no Internet, try locally: ds <- read.csv(file.choose())
head(ds)
# This data set has 729 observations of a response variable "r" (1 means r was
# pronounced, 0 means it was not pronounced), with three predictor variables.
# store: which of the three department stores the observation was made at
# emphasis: whether it was the first repetition ("normal") or the second ("emphatic")
# word: whether it was the word "fourth" or the word "floor"
# Labov did not carry out regression on this data. If we do, we can see that
# only two of his three variables are statistically significant. Which ones?
# To fit a logistic model, use the glm() function and the binomial family, e.g.:
model3 <- glm(r ~ 1, ds, family = binomial)
model4 <- glm(r ~ store, ds, family = binomial)
model5 <- glm(r ~ emphasis + word, ds, family = binomial)
model6 <- glm(r ~ store + emphasis + word, ds, family = binomial)
# Note that the other predictors in a model can affect the significance as well as
# the coefficient estimates of another predictor. The "step-down" comparison (model5
# vs. model6) is usually more reliable than the "step-up" (model3 vs. model4).
# Another point about logistic regression is that it requires more observations
# to obtain the level of results of an equivalent linear regression.
################################################
# PART TWO: WHY DO WE NEED MIXED-EFFECTS MODELS?
################################################
# We saw that fixed-effects regression models overestimate the significance
# (underestimate the p-value) of higher-level predictors when the grouping
# structure of the data is ignored. This is probably their biggest drawback.
# If the data is not only grouped, but the amount of data from each group is
# not equal (unbalanced data), then fixed-effects models will produce estimates
# that are weighted accordingly, which is usually not the desired result.
# And specifically in logistic regression, predictor effects will be shrunken
# if the groups in the data (e.g. speakers) are regularly different.
# The following function compares fixed-effects and mixed-effects results for a
# simulated logistic regression data set, allowing you to vary the true underlying
# parameters. It assumes two groups of speakers, with a binary "between-speaker"
# effect between the groups (e.g. gender). There is also a "within-speaker" effect
# (e.g. speech style) that is constant for all speakers.
# Load the function by cutting-and-pasting (or Ctrl-R or Command-Enter), then
# run the function, experimenting with different parameter settings
compare <- function(speakers, tokens, between, speaker.sd, within){
w <- rep(c(-within / 2, within / 2), each = speakers * tokens)
b <- rep(rep(c(-between / 2, between / 2), each = speakers * tokens / 2), 2)
s <- rep(rep(rnorm(2 * speakers, 0, speaker.sd), each = tokens / 2), 2)
dat <- data.frame(w = w, b = b, s = s)
dat$y <- rbinom(nrow(dat), 1, plogis(dat$w + dat$b + dat$s))
if (within == 0) dat$w <- rep(c(0, 1), each = speakers * tokens)
if (between == 0) dat$b <- rep(rep(c(0, 1), each = speakers * tokens / 2), 2)
if (speaker.sd == 0) dat$s <- rep(rep(1:(2 * speakers), each = tokens / 2), 2)
m <- glm(y ~ as.factor(b) + as.factor(w), dat, family = binomial)
fixed.between.e <- summary(m)$coefficients[2, 1]
fixed.between.p <- summary(m)$coefficients[2, 4]
fixed.within.e <- summary(m)$coefficients[3, 1]
fixed.within.p <- summary(m)$coefficients[3, 4]
mm <- glmer(y ~ as.factor(b) + as.factor(w) + (1 | s), dat, family = binomial)
mixed.between.e <- summary(mm)$coefficients[2, 1]
mixed.between.p <- summary(mm)$coefficients[2, 4]
mixed.within.e <- summary(mm)$coefficients[3, 1]
mixed.within.p <- summary(mm)$coefficients[3, 4]
res <- data.frame(
between.est = c(fixed.between.e, mixed.between.e),
between.p = c(fixed.between.p, mixed.between.p),
within.est = c(fixed.within.e, mixed.within.e),
within.p = c(fixed.within.p, mixed.within.p))
rownames(res) <- c("fixed", "mixed")
signif(res, 3)}
# The following command says there are 20 speakers per group, 100 tokens per
# speaker, with a between-speaker effect size of 0.5 log-odds, a speaker intercept
# standard deviation of 1.0 log-odds, and a within-speaker effect size of 0.5 log-odds.
# The output gives the effects sizes and p-values for both effects for both models.
compare(20, 100, 0.5, 1.0, 0.5)
# between.est between.p within.est within.p
# fixed 0.145 0.0222 0.374 4.15e-09
# mixed 0.210 0.4460 0.431 2.45e-10
#
# (Since random numbers are being generated, your results will differ.)
# You should see that the between-speaker significance varies greatly between
# the fixed-effects and mixed-effects models, but the within-speaker effect is
# similar across models. Practice with between = 0, within = 0, and other settings.
# Practice repeating the function with the same settings to see the effect of chance.
# Remember than in a data set with a crossed grouping structure - for example,
# speaker and word - what is a "between" predictor with respect to one grouping
# factor can be a "within" predictor with respect to the other - and vice versa.
# Using crossed random effects (unlike this function), we can capture this variability.
# If speaker variability (speaker.sd) is very low, or zero, there will be no
# difference between the fixed-effects and mixed-effects models. This shows that
# the glmer() method is "safe" to use even if there might not be group variation.
compare(20, 100, 0.5, 0.05, 0.5)
# between.est between.p within.est within.p
# fixed 0.398 4.78e-10 0.446 3.02e-12
# mixed 0.398 4.78e-10 0.446 3.02e-12
# You may also notice that if speaker variability is high, it can obscure a real
# between-speaker effect. In this case, the fixed-effects model is more accurate.
# But the fixed-effects model also identifies spurious between-speaker effects,
# so the more conservative behavior of the mixed-effects model is still preferable.
##########################################
# PART THREE: FITTING MIXED-EFFECTS MODELS
##########################################
# To practice fitting models with random intercepts and random slopes, we will use
# a fairly typical sociolinguistic data set. It consists of 3000 observations of a
# binary variable, which is the presence or absence of post-vocalic /r/ on the
# Lower East Side of New York City. This data was collected by Kara Becker.
les.r <- read.csv("http://www.danielezrajohnson.com/les_r.csv")
# if no Internet, try locally: les.r <- read.csv(file.choose())
# This is a binary response variable, so we use logistic regression.
# We are estimating the log-odds of the response, log(p / (1 - p)), where
# p is the probability of a realized post-vocalic /r/.
# This data file has five possible predictor variables. Two of them, speaker and
# word, can be used as grouping factors. Since there are only 7 speakers, this
# could easily be used as a fixed effect. Note that 5 is the minimum recommended
# number of groups for a grouping factor.
# Age is a between-speaker factor. If we can, we would like to know if younger
# speakers are using more post-vocalic /r/. With only two younger speakers and
# five older speakers, this may be impossible to confirm.
# Stress is a between-word factor. Each word either has a stressed /r/, like "car",
# or an unstressed /r/, like "father". This is a major predictor of /r/ realization.
# Topic - whether the speaker is talking about the neighborhood or not - is neither
# a between-speaker factor nor a between-word factor. Each speaker is sometimes
# talking about the neighborhood, sometimes not. Similarly, any particular word
# can be used to talk about the neighborhood or to talk about something else.
# This matters if we are selecting random slopes. The different speakers could have
# different stress effects, so a random slope (stress | speaker) is possible. The
# different words could conceivably show different age effects, so (age | word) is
# a possible random slope. The effect of topic could vary by speaker and/or by
# word, meaning that both (topic | speaker) and (topic | word) are possible.
# First, however, we will fit a basic fixed-effects model and see what happens.
model7 <- glm(r ~ 1 + age + stress + topic, les.r, family = binomial)
# Looking at summary(model7), we see that as usual, the fixed-effects model is
# reporting incredibly significant p-values for all three predictors.
# What happens to those p-values if we use random effects? First, we will introduce
# random intercepts for speaker and word. We know that /r/ can vary along these lines.
model8 <- glmer(r ~ 1 + age + stress + topic + (1 | speaker) + (1 | word),
les.r, family = binomial)
# What happens if we use a "maximal" random effect structure, with random slopes
# for all three predictors? Such a maximal structure is the most conservative,
# and is recommended as long as the model fits within an acceptable length of time.
model9 <- glmer(r ~ 1 + age + stress + topic + (1 + stress + topic | speaker) +
(1 + age + topic | word), les.r, family = binomial)
# Luckily, model9 took less than a minute to fit. What has happened to the p-values?
# Do we still want to conclude that all three predictors affect the response?
# The summary() function is one way to get p-values, and especially when the
# predictors are binary, this is very convenient. We can also get p-values by
# model comparison.
# Another useful function in this regard is drop1(). It conducts a series of tests
# of the significance of all the predictors in the model. It can be slow.
drop1(model9, test = "Chisq")
# Single term deletions
# Model:
# r ~ 1 + age + stress + topic + (1 + stress + topic | speaker) +
# (1 + age + topic | word)
# Df AIC LRT Pr(Chi)
# 3503.7
# age 1 3509.9 8.1649 0.0042708 **
# stress 1 3515.6 13.9107 0.0001917 ***
# topic 1 3506.6 4.8949 0.0269358 *
# In this section, we have been focusing on p-values. At least for significant
# predictors, we also want to discuss the actual effect size estimates. In a mixed
# model, we also have information about the random effect variances as well as the
# correlations between them.
# It is also possible to look at the individual random effect estimates, which
# are not parameters of the model, but are still useful. This will be discussed
# further in the next section.
##############################################
# PART FOUR: WORKING WITH MIXED-EFFECTS MODELS
##############################################
# The output of a lmer() or glmer() model contains a lot of information. Some of it
# is easy to understand, but other parts require some explanation.
# Let's look at the output of model9. Either
print(model9)
# or simply entering
model9
# will produce the following output:
# Generalized linear mixed model fit by maximum likelihood ['glmerMod']
# Family: binomial ( logit )
# Formula: r ~ 1 + age + stress + topic + (1 + stress + topic | speaker) +
(1 + age + topic | word)
# Data: les.r
# AIC BIC logLik deviance
# 3503.716 3599.818 -1735.858 3471.716
# Random effects:
# Groups Name Std.Dev. Corr
# word (Intercept) 0.6808
# ageyounger 0.8777 0.03
# topicother 0.3554 0.71 -0.44
# speaker (Intercept) 0.2832
# stressyes 0.1162 0.22
# topicother 0.2320 0.99 0.34
# Number of obs: 3000, groups: word, 544; speaker, 7
# Fixed Effects:
# (Intercept) ageyounger stressyes topicother
# -1.5999 1.2534 0.7886 0.3622
# Let's go through this line by line. First, the top four lines:
# # Generalized linear mixed model fit by maximum likelihood ['glmerMod']
# # Family: binomial ( logit )
# # Formula: r ~ 1 + age + stress + topic + (1 + stress + topic | speaker) +
# (1 + age + topic | word)
# # Data: les.r
# This reminds you what sort of model has been fit, where a "generalized" LMM
# means anything but an ordinary linear mixed model with a numeric response.
# This is a bit imprecise since LMM is one case of GLMM, but it's used this way.
# The next two lines are basically four versions of the same thing. All these models
# are fit by maximum likelihood, which means, more or less, that the software finds
# the parameter values (directly related to the predictor estimates) that make the
# observed data most probable. Of course, the chance of obtaining the exact data
# observed is still very small, even when it is maximized. In this case, the
# logarithm of the likelihood, at its maximum, is -1735.858.
# # AIC BIC logLik deviance
# # 3503.716 3599.818 -1735.858 3471.716
# The deviance is defined as -2 times the log-likelihood. This statistic is what
# is used to compare models, because under the null hypothesis the deviance follows
# a chi-squared distribution with degrees of freedom equal to the change in
# parameters. What does this mean?
# Say you have a model with one parameter - a simple intercept. You are testing
# whether a model with two parameters - an intercept and a slope, for example -
# offers a significant improvement (it will always offer some improvement, but so
# would adding any parameter). The change in deviance, assuming the true value of
# the new parameter is zero (and a few other things), is chi-squared distributed.
# This allows us to calculate the p-value. This is called a likelihood ratio test.
# Both anova(m1, m2) and drop1(m1, test = "Chisq") perform likelihood ratio tests.
# The AIC and BIC are based on the deviance, penalized for the number of parameters
# in the model. Instead of performing a formal model comparison test, some people
# prefer to choose the model with a lower value of AIC or BIC. I am not fully
# informed on why they do this, or how it is really that different from a LRT.
# You can extract these values from any model as follows:
logLik(model9)
deviance(model9)
AIC(model9)
BIC(model9)
# None of these four statistics are useful to report, because their values depend on
# the size of the data set. Only models fit to the same data can be compared this way.
# Let's move to the fixed effects part of the output, which fixef() can give us too:
fixef(model9)
# (Intercept) ageyounger stressyes topicother
# -1.5999 1.2534 0.7886 0.3622
# The model's prediction is always obtained by adding up these coefficients. The
# problem is sometimes knowing which to add. Here, the intercept value is the
# predicted log-odds for an older speaker, in an unstressed syllable, talking about
# the Lower East Side Neighborhood. If the speaker was younger, we would add 1.2534.
# If the syllable was stressed, we would add 0.7886. If the topic was something else,
# we would add 0.3622.
# We could relevel any of these factors and re-fit the model. For example:
# les.r$topic <- relevel(les.r$topic, "neighborhood")
# All these numbers are in log-odds units, but once they are combined, they can
# be converted back to probabilities using the plogis() function.
plogis(fixef(model9)[1]) # 0.168 (older, unstressed, neighborhood topic)
plogis(sum(fixef(model9)[1:4])) # 0.691 (younger, stressed, other topic)
# How do these predictions compare to the actual observed proportions? Let's see.
mean(les.r$r[les.r$age == "older" &
les.r$stress == "no" &
les.r$topic == "neighborhood"]) # 0.224
mean(les.r$r[les.r$age == "younger" &
les.r$stress == "yes" &
les.r$topic == "other"]) # 0.686
# The intercept cell is quite far from its prediction. There are two likely reasons for
# this. First, if we have unmodeled interactions in the fixed effects, some of the
# cells will not fit well. Second, this model has random effects for speaker and word.
# The fixed effects reported are for a sort of average speaker and word. However,
# word, especially, tends to be a very skewed variable. There will always be a few
# very common words, that may favor or disfavor the response. The mixed model
# largely counteracts this weighting.
# Calculating the fixed effects becomes more complicated when there are continuous
# predictors, and especially when there are interactions.
# Finally, let us examine what is unique to the mixed model, the random effects.
# We can also use the VarCorr() extractor to obtain this part of the output:
VarCorr(model9)
# Groups Name Std.Dev. Corr
# word (Intercept) 0.68076
# ageyounger 0.87774 0.031
# topicother 0.35537 0.706 -0.440
# speaker (Intercept) 0.28318
# stressyes 0.11620 0.224
# topicother 0.23204 0.993 0.337
# Above, we see 12 numbers. These are all parameters of the model, on a par with
# the fixed-effects estimates. This model has 4 fixed-effects parameters and 12
# in the random effects. Of these 12, six are estimates of standard deviations.
# The other six are correlation parameters, which we won't discuss further.
# The model is using the sample of words in the data and making an estimate that
# the population of all words varies - some words favoring the response, some words
# disfavoring it - following a normal distribution with a standard deviation of 0.681.
# That the random effects are normally distributed is one of the assumptions of
# the mixed-effects model, although I have never seen it addressed how bad it is
# if this assumption is not quite met...
# The intercept variation by speaker is 0.283. This is a very low number. It may
# be that with only 7 speakers, the software is not estimating it properly.
# The other standard deviations can be interpreted in the same way, except they are
# for the slopes, not the intercepts. Speakers - all speakers, insofar as they can be
# estimated from a sample of 7 - vary in their stress effect according to a normal
# distribution with a standard deviation of 0.116. They vary in their topic effect
# with a standard deviation of 0.232.
# If these random effects had been larger, we would have seen larger changes to the
# model, as we added them. A large intercept standard deviation for speaker would
# usually take away from the significance of a between-speaker predictor like age.
# A large standard deviation for the stress random slope would lower the overall
# significance of stress. However, all these variance estimates are quite small.
# The final element of the lmer/glmer model is not visible in the print() output.
# These are the so-called BLUPs (best linear unbiased predictors) or "conditional
# modes" of the random effects. We obtain these with the ranef() extractor function.
# ranef(model9) # too much output!
ranef(model9)$speaker
# (Intercept) stressyes topicother
# Ann -0.12699043 -0.070881008 -0.11756414
# Lindsey 0.13570318 -0.016811909 0.10338279
# Lucille -0.05119173 0.026939650 -0.03408799
# Mae -0.25268322 -0.096255293 -0.22313884
# Maggie 0.57757161 0.076152021 0.47551887
# Michael -0.04597223 0.009819603 -0.03402146
# Tobias -0.08541389 0.044407880 -0.05694121
# These effects are not independent parameters of the model. They are a kind of
# best guess, given the variance parameters that are estimated. In any case,
# they can be treated as deviations from the corresponding fixed effects.
# Note: As a rule, always include fixed effects corresponding to each random effect!
# So if we take the fixed effects again:
fixef(model9)
# (Intercept) ageyounger stressyes topicother
# -1.5999001 1.2534203 0.7885791 0.3621711
# We know that the age effect does not vary by speaker - this makes no sense - but
# the other fixed effects do: the intercept, the stress effect, and the topic effect.
# We can add the fixed and random effects together to see the total prediction.
# There is another extractor function that does this for you: coef().
coef(model9)$speaker
# (Intercept) ageyounger stressyes topicother
# Ann -1.726891 1.25342 0.7176981 0.2446070
# Lindsey -1.464197 1.25342 0.7717672 0.4655539
# Lucille -1.651092 1.25342 0.8155188 0.3280831
# Mae -1.852583 1.25342 0.6923238 0.1390323
# Maggie -1.022328 1.25342 0.8647312 0.8376900
# Michael -1.645872 1.25342 0.7983987 0.3281497
# Tobias -1.685314 1.25342 0.8329870 0.3052299
# Even though there is a column for age, this means nothing and shouldn't be there.
# The age effect of +1.253 has not been applied here, and it should be (to Lindsey
# and Tobias, the younger speakers) in order to make this a more sensible display.
# Of course, there is a corresponding section of ranef(model9) and coef(model9)
# for the word grouping factor, it was just easier to deal with the subjects.
#################
SETTING CONTRASTS
#################
# Sometimes sum contrasts (or even another type) work better than baseline-treatment
# contrasts for a particular model. You can set contrasts globally or variable-by-
# variable. To set contrasts globally, use one of these commands:
options(contrasts = c("contr.sum", "contr.poly")) # sets sum contrasts
options(contrasts = c("contr.treatment", "contr.poly")) # sets treatment contrasts
# To set the contrasts for an individual variable:
contrasts(les.r$age) <- contr.sum(2) # 2 is the number of levels
contrasts(les.r$age) <- contr.treatment(2)
# In treatment contrasts, the first level - use levels() to check - is the baseline,
# and does not appear in the output. The other levels are relative to the baseline.
# In sum contrasts, the first level is labeled 1, the second level 2, etc. The
# coefficients represent deviations from the mean. The last level does not appear
# in the output. Its coefficient would be zero minus the sum of the other levels.