# INTRODUCTION TO REGRESSION IN R AND IN RBRUL
# Daniel Ezra Johnson
# SSS 5
# 30 July 2014
# THIS FILE CAN BE DOWNLOADED FROM:
# http://www.danielezrajohnson.com/sss5_workshop.R
# INTRODUCTION
# You should save this file on your computer.
# Then make sure you re-open it in R.
# This file is an example of a (heavily annotated) R script.
# Instead of manually typing in R commands, you can also highlight lines in a script,
# and press Command-Enter (Mac) or Ctrl-R (Windows) to execute those commands.
# Many people arrange the desktop so that the script window is on one side of the
# screen, and the R console on the other.
# This makes it easy to work in both windows and go back and forth.
# All the lines beginning with # are comments.
# This means that they have no effect when a section of text is run in R.
# In fact, R will ignore whatever follows a # even within a single line.
# I will use such in-line comments for less important notes.
# Try typing "2 + 2" into the R Console, followed by Return/Enter.
# Now highlight the "2 + 2" below, in the script, and press Command-Enter/Ctrl-R.
# The same thing should happen in both cases.
2 + 2
# Did you get this output?
# [1] 4 # the [1] can be ignored here
# What about the following:
2 + 2 * 3
# Did you get 4 * 3 = 12?
# No - the result is 8.
# As you can see from this result, R follows the standard "order of operations".
# Unless you put parentheses, multiplication will be carried out before addition.
(2 + 2) * 3
# Now you should get the answer 12.
# We often have to use parentheses in R.
# The R GUI uses colored parenthese to help you not lose track of them.
# WHAT IS R AND WHAT IS RBRUL?
# R is a programming language or "environment" for statistics and data analysis.
# With R, we can perform practically any mathematical or statistical calculation.
# We can use hundreds of built-in functions and also define our own functions.
# We can obtain statistics, including regression coefficients and p-values, easily.
# In this workshop, we will explore regression using R proper, and also with Rbrul.
# Rbrul is a front-end interface to some of the regression functions in R.
# Rbrul may be easier and more intuitive - but R is much more powerful and flexible.
# INSTALLING PACKAGES AND RBRUL
# Execute the following commands to install necessary packages.
# If you already did this, you don't have to do it again (but it doesn't hurt!).
# Be careful with capitalization. R is very sensitive to capitalization.
install.packages("ggplot2")
install.packages("lme4")
install.packages("lmerTest")
# Once the packages are installed, you must load them with the library() command.
# Unlike install.packages(), you don't need quotation marks for library().
library(ggplot2)
library(lme4)
library(lmerTest)
# We install Rbrul with the command source(), which loads an external program.
# Assuming you have an Internet connection, run the following command:
source("http://www.danielezrajohnson.com/Rbrul.R")
# If the source() command returns an error message, try it again.
# If you don't have access to the Internet right now, it won't work.
# However, if you downloaded the file Rbrul.R previously, you can substitute
# the local path to the saved file, for example: source("~/Desktop/Rbrul.R").
# In general, you should source Rbrul from the Web, to get the latest update.
# A window wil open saying "This window has opened to help Google Analytics
# track the spread of Rbrul around the globe. You may close it at any time."
# You may close it at any time.
# THREE GOALS OF REGRESSION
# Traditionally, there are three reasons why we build regression models:
# One is to quantitatively ESTIMATE the effects of predictors on a response.
# Predictor means independent variable; response means dependent variable.
# A second reason is to TEST HYPOTHESES (for example, significance testing).
# A third reason is to use observed data to PREDICT new or unseen data.
# This workshop will focus on estimation and simple hypothesis testing.
# We will ask how big an effect an predictor has on a response (estimation).
# And we'll ask if the effect could be the result of chance (hypothesis testing).
# WHAT ARE MIXED MODELS?
# Mixed models or mixed-effects models are a fairly new type of regression model.
# They're useful if data has a grouped structure (e.g. tokens grouped by speaker).
# I believe that many sociolinguistic situations require mixed-effects modeling.
# This makes it hard to properly illustrate "ordinary" regression with real data!
# ORDINARY (FIXED-EFFECTS) REGRESSION IN R
# We can create a pretend data set with a single command:
toy <- data.frame(
x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
y = c(1, 3, 2, 5, 4, 7, 4, 6, 5, 7))
# Note that we could have also expressed x as "0:9" instead of "c(0, 1, ..., 9)".
# We can plot these points with qplot(), a function in the ggplot2 package.
qplot(x = x, y = y, data = toy)
# When you open a graphing window, it may need to be resized to display correctly.
# Here, our response (y) is numeric, and we will perform a linear regression.
# Linear regression finds the straight line that is the best fit for these points.
# Linear regression minimizes the "sum of squared errors".
# It minimizes the squared vertical distances from the points to the line.
# We can add this best-fit line to the plot using the geom_line argument.
qplot(x = x, y = y, data = toy) + geom_line(stat = "smooth", method = "lm")
# Any straight line can be described by two numbers: its slope and its intercept.
# The slope of the line is the amount it goes up as you move one unit to the right.
# (If the line goes down as you move to the right, the slope will be negative.)
# The intercept is the value of y at x = 0 (where the line "intercepts" the y-axis).
# Just by eyeballing the line on the plot, we would say that its intercept is 2.
# And we would eyeball that its slope is a little bit more than 0.5.
# Together, these two parameters make up a simple rough model of the data.
# R fits linear regression models - finds the best line - with the lm() command.
# We feed a formula of the form "y ~ x" to lm(), and tell it where the data is.
lm(y ~ x, data = toy)
# The relevant part of this output is the "Coefficients", which are as follows:
# (Intercept) x
# 2.0000 0.5333
# We see that we were correct in our eyeballing. The intercept is exactly 2.
# The slope is 0.5333, meaning that y goes up 0.5333 for every increase of 1 in x.
# This is estimation of the parameters of a model obtained by linear regression.
# A hypothesis we might want to test is the question of statistical significance.
# We might want to know if the slope, 0.5333, is significantly different than zero.
# More precisely, we want to know the proportion of the time that a slope that large
# would be seen, on average, if we just randomly threw 10 points against a wall.
# Above, we just printed the lm() model directly to the screen.
# We can also store the model and give it a name.
m1 <- lm(y ~ x, toy)
# Then we can print it out, or work with it as needed. Test the following commands:
m1 # prints the parameters of the model
coef(m1) # prints just the coefficients of the model
coef(m1)[1] # prints the intercept (the first coefficient)
coef(m1)[2] # prints the slope (the second coefficient)
# To test the null hypothesis that the true slope is zero, we can fit a null model.
# This is a model with an intercept (the 1), but no slope. The slope is fixed at 0.
m0 <- lm(y ~ 1, toy)
m0
# _________________________________________________________________________________
# EXERCISE 1
# In the model m0, what is the value of the intercept?
# _________________________________________________________________________________
# Note that m1, with an intercept and a slope for x, could also have been fit
# with the formula y ~ 1 + x. The formula y ~ x, is short for y ~ 1 + x. You could
# also tell the program to omit the intercept, forcing the line to pass through
# the origin (0, 0). You would do this with the formula y ~ 0 + x.
# To test the significance of the slope in m1, we need to compare two models:
# m1 and m0. m1 has an intercept and a slope, while m0 only has an intercept.
# As it is a more complex model, m1 will ALWAYS fit the data better than m0.
# However, it is a more complex model. In general, we prefer simpler models.
# We do a test to decide if the better fit is worth the added complexity. This
# can be called a "significance test" - testing the "statistical significance"
# of the slope. It is also called a "hypothesis test" - testing the hypothesis
# that the underlying slope is zero. If we reject this, then we have an effect!
# In the case of ordinary linear regression, this is done with an F-test.
# More generally, two models are compared with a likelihood-ratio chi-squared test.
# Note: this is also how GoldVarb compares models in its stepwise routines.
# If you haven't heard of GoldVarb, consider yourself lucky.
# We use anova() to compare the models, perform an F-test and obtain a p-value.
# Despite the name, this command does not necessarily carry out "an ANOVA".
# ANOVA is specifically for comparing discrete groups, which we're not doing here.
anova(m0, m1) # note that the order of terms does not matter
# In the anova(m0, m1) output, focus on where it says "Pr(>F)":
# Pr(>F)
# 0.005163
# This is our p-value. It is well below the usual .05 significance threshold.
# Therefore, we tend to believe that the observed trend is real and not just noise.
# This toy example had a single predictor, x.
# Naturally, regression becomes more useful when there is more than one predictor.
# Then, we estimate several "slopes" simultaneously (this is harder to visualize).
# We might have a formula y ~ x1 + x2, or y ~ x1 * x2 (indicating an interaction).
# Note: this is called "multiple regression", not "multivariate regression".
# Multivariate regression is when there is more than one dependent variable.
# _________________________________________________________________________________
# EXERCISE 2
# If the y values were chosen at random, any trend would have to be due to chance.
# Repeat the following commands by highlighting and using Command-Enter/Ctrl-R.
# Keep an eye on the p-values.
x <- 1:10 # x = 1, 2, ..., 10
y <- sample(1:10) # y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
anova(m0, m1)
# How often is the p-value below .05?
# With a true slope of zero, the p-value should fall below .05 .... 5% of the time.
# Try it 100 times! How many times did you get a "false positive" (Type I error)?
# _________________________________________________________________________________
# ORDINARY (FIXED-EFFECTS) REGRESSION IN RBRUL
# R works by entering commands (or sets of commands) at the > prompt.
# Rbrul is based on text menus and it interacts with the user using questions.
# You can cut and paste multiple lines into Rbrul at once, but it's not a great idea.
# You also can't go backwards if you make a mistake.
# However, you can quit Rbrul (either from a menu or with the Stop button) and
# go back in without starting over. Or you can start over with the "reset" option.
# We will attempt to carry out the same regression that we did above, in Rbrul.
# First, note that Rbrul always uses the object named "datafile" to hold its data.
# We can load any file with Rbrul, but since we already have our "toy" data frame,
# we'll use the assignment operator (a left-pointing arrow) to create the "datafile".
datafile <- toy
# Note that you can use = instead of <-, but in R circles, this is a grave faux pas.
# We start Rbrul with the R command rbrul().
rbrul()
# After a moment and some messages, you should see the data structure and main menu.
# If you don't, it usually means that required packages have not loaded.
# On office computers, you may need to run R "as administrator" to install packages.
## Current data structure:
## x (numeric with 10 values): 0 1 2 3 4 ...
## y (numeric with 7 values): 1 3 2 5 4 ...
## Total tokens: 10
## MAIN MENU
## 1-load/save data 2-adjust data
## 4-crosstabs 5-modeling 6-plotting
## 8-restore data 9-reset 0-exit
# To fit a regression model, we type "5" for "modeling", and press Return/Enter.
5
# We then see the modeling menu:
## MODELING MENU
## 1-choose variables 2-one-level (recommended)
## 3-step-up 4-step-down 5-step-up/step-down
## 6-plotting 8-settings 9-main menu 0-exit
## 10-chi-square test
# The first thing we have to do is choose our variables.
# For example, we have to tell Rbrul which variable is the response.
# We enter "1" to choose variables.
1
## Choose response (dependent variable) by number (1-x 2-y).
# To choose y as the response variable, we type "2" followed by Return/Enter.
2
## Type of response? (1-continuous Enter-binary)
# Here we type "1" plus Return/Enter, because y is a numeric/continuous variable.
# When the response is a real number, it is called linear regression.
# (If the response was binary, we would be doing logistic regression. See below.)
1
## Choose predictors (independent variables) by number (1-x).
# Here we only have one choice: we type "1" (plus Return/Enter) to select x.
1
## Are any predictors continuous? (1-x Enter-none)
# Indeed, our predictor x is a numeric/continuous predictor, so we enter "1".
1
# Since there are no other predictors, Rbrul returns us to the modeling menu,
# after displaying the "Current variables": a continuous response y, and a fixed,
# continuous predictor x. Here, "fixed" means a fixed effect, an ordinary predictor.
# The alternative to fixed effects is random effects, which we will use later on.
# There is only one predictor being considered for our model: x.
# Therefore, it doesn't matter if we choose "step-up", "step-down", or "one-level".
# Stepwise modeling can be dangerous, so we will get in the "one-level" habit,
# entering "2" for a one-level analysis.
2
## ONE-LEVEL ANALYSIS WITH x (0.00516)
##
## $x
## continuous coef
## +1 0.533
##
## $misc
## deviance AIC df intercept grand mean R2.fixed R2.random R2.total
## 12.933 36.951 2 2 4.4 coming soon!
# The above output is Rbrul's summary of the model with response y and predictor x.
# We see the p-value for x at the top of the summary: 0.00516.
# It is the same as what we obtained before, just rounded to 3 significant digits.
# We see the slope, 0.533, under the heading "coef" in the "x" section.
# We see the intercept, 2, under the heading "intercept" in the "misc" section.
# We also see other information, like the "grand" (overall) mean.
# We also should see the R-squared value, a measure of the fit of the model.
# Right now, there is a problem with the calculations; it says "coming soon!"
# MIXED-EFFECTS REGRESSION IN RBRUL
# We will now run a regression on a real linguistic data set.
# The response variable is post-vocalic or coda /r/; there are 1597 tokens of it.
# 40 speakers in Gretna, Scotland, read a word list containing 40 or so coda /r/s.
# The variable was coded narrowly, but we focus on zero vs. any other realization.
# This is a binary dependent variable, so we will be running a logistic regression.
# Predictors are modeled as affecting the LOG-ODDS of the response in a linear way.
# Log-odds are a way of taking the 0-to-1 probability scale, and stretching it out,
# so that it covers all numbers from negative infinity to positive infinity.
# Log-odds equals the natural logarithm of the odds, or log(p / (1 - p)).
# A log-odds of 0 corresponds to 0.5 probability.
# A log-odds of +1 corresponds to a probability of 0.73; -1 corresponds to 0.27.
# A log-odds of +2 corresponds to a probability of 0.88; -2 corresponds to 0.12.
# With coda /r/, a log-odds of 0 corresponds to 50% /r/ pronunciation.
# But regression mainly deals with effects, that is, changes in log-odds.
# On the log-odds scale, the difference between 50% and 70% /r/ is 0.85 log-odds.
# A similar change of 0.85 would get us from 70% to 85% /r/, from 85% to 93% /r/,
# or from 93% to 97%.
# We might say that a fixed change in log-odds "counts for less" the closer we are
# to 100% (or to 0%). Alternatively, we could say that a change from 5% to 10% is
# simply larger than a change from 40% to 50%. Both interpretations can make sense.
# To load the Gretna datafile, first enter "9" to go to the "Main Menu".
# If you have already saved the Gretna file on your computer:
1 # load/save data
[Enter] # do not save current data
c # commas separate the columns
[choose file] # navigate to the file and choose it with the wizard
# If you have not saved the Gretna file on your computer, and have Internet access:
0 # to exit Rbrul
datafile <- read.csv("http://www.danielezrajohnson.com/gretna.csv")
rbrul() # to re-enter Rbrul
# We should now be at the main menu in Rbrul.
# The "Current data structure" shows:
## word (factor with 58 values): air barred bars beard car ...
## set (factor with 12 values): SQUARE START NEAR NORTH FORCE ...
## position (factor with 2 values): coda-final coda-internal
## r4 (factor with 4 values): approximant zero tap trill
## r2 (integer with 2 values): 1 0
## speaker (factor with 40 values): G15F G16F G17F G17F2 G18F ...
## age (integer with 24 values): 15 16 17 18 19 ...
## gender (factor with 2 values): female male
## agegroup (factor with 2 values): young old
## class (factor with 2 values): working middle
## Total tokens: 1597
# Our response variable will be r2, which is coded 0 for zero /r/ and 1 otherwise.
# We will do a regression to estimate the size of three "fixed effects":
# set - how does the preceding vowel affect coda /r/?
# age - how does speaker age affect coda /r/?
# gender - how does speaker gender affect coda /r/?
# We may not be specifically interested in individual words and speakers, today,
# but we must take them into account when the data is grouped by word and speaker.
# That is, the data has multiple observations of each word and of each speaker.
# Word is nested within set (each word is always in the same lexical set).
# Speaker is nested within gender and age (the data was collected all at one time).
# This matters because if there is a lot of by-speaker or by-word variation,
# what look like significant predictor effects can plausibly arise just by chance.
# We control for this by using two "random effects" for speaker and word.
# Random effects usually have many levels, and nest within predictors of interest.
# From the Rbrul main menu, enter the following commands (manually):
5 # modeling menu
1 # choose variables
5 # choose r2 as the response
[Enter] # r2 is a binary variable
2 # we will model what favors coda /r/, not what disfavors it
7 # choose predictor age
8 # choose predictor gender
2 # choose predictor set
1 # choose predictor word
6 # choose predictor speaker
[Enter] # finish choosing predictors
# Rbrul should now be asking "Are any predictors continuous?" Age is continuous.
# As a rule, it's better to leave naturally-numeric predictors like age as
# continuous, rather than (for example) making several discrete age groups.
7 # mark age as continuous
[Enter] # finish (the other predictors are categorical)
# Rbrul now asks if we want to consider "a pairwise interaction of fixed effects".
# There is sometimes confusion about what an interaction means, in statistics.
# Suppose our older speakers are mostly male and our younger speakers mostly female.
# This is not an interaction! It should be called an association or a lack of
# independence. The term "non-orthogonality" has also been used in sociolinguistics.
# An interaction is when two or more predictors affect the response in a way
# that is not independent. For example, if the effect of gender on /r/ was greater
# for older speakers, that would be an interaction between gender and age. We will
# not be examining possible interactions right now; feel free to experiment later!
[Enter] # finish (don't consider any interactions)
# Rbrul now asks if there are any random intercepts. We select speaker and word.
# This model will allow individual speakers and words to vary in their overall
# rates of /r/ realization, which is sort of like giving each word and speaker
# its own intercept. A random intercept is one type of random effect. A more
# advanced type of random effect, which we will skip over today, but is often
# necessary to consider, is called a random slope. An example would be if some
# words showed more /r/ with males and other words showed more /r/ with females.
# Such things have been known to happen, but we will not consider this here.
1 # mark word as random intercept
6 # mark speaker as random intercept
[Enter] # finish (the other predictors are fixed effects)
# Rbrul now asks about random slopes corresponding to each random intercept.
[Enter] # no by-word random slopes
[Enter] # no by-speaker random slopes
# Rbrul now takes us back to the modeling menu.
# We will choose "2" for a one-level model, considering all predictors.
# A one-level model is like a step-down model except it only does the first step.
2 # choose one-level model
[Enter]
# After a minute... a long output results, including estimates for word and speaker.
# Note: mixed-effects models can take a long time to fit - sometimes very long!
# They can also have estimation problems that do not affect fixed-effects models.
# Currently, Rbrul suppressed some of the warnings that the actual model-fitting
# functions can produce. Therefore, you may want to try fitting the same models
# in R (see the next section) to be sure that you are not being led astray.
# (In the "Settings" menu, you can switch off the long random-effect display.)
# Scroll back to the top, where we see coefficients for age, gender, and set.
## ONE-LEVEL ANALYSIS WITH word [random, not tested] and speaker [random, not
## tested] and set (7.1e-05) + age (0.000392) + gender (0.734)
##
## $gender
## factor logodds tokens 1/1+0 centered factor weight
## male 0.131 795 0.630 0.533
## female -0.131 802 0.587 0.467
##
## $set
## factor logodds tokens 1/1+0 centered factor weight
## FIR 1.109 119 0.748 0.752
## OUR 0.859 7 0.857 0.702
## NURSE 0.597 225 0.689 0.645
## SQUARE 0.413 73 0.671 0.602
## NORTH 0.238 125 0.672 0.559
## HEARD 0.000 152 0.625 0.5
## FORCE -0.235 180 0.617 0.441
## NEAR -0.245 86 0.616 0.439
## START -0.439 155 0.606 0.392
## CURE -0.482 112 0.562 0.382
## FIRE -0.598 12 0.667 0.355
## LETTER -1.218 351 0.470 0.228
##
## $age
## continuous logodds
## +1 0.059
# First look at the top line. It says we have word and speaker as random effects.
# These random effect terms are not tested for significance.
# Then it gives (lexical) set, with a p-value of 7.1e-05, which means 0.000071.
# Lexical set, then, is a highly significant predictor of coda /r/.
# Age is given a p-value of 0.000392, which is also highly significant.
# Gender is not significant: p = 0.734. The observed difference could be chance.
# Turning to the estimates, we see that the first column of numbers is in log-odds.
# Log-odds is the standard unit for reporting the results of logistic regressions.
# We just remember that a given change in log-odds has its biggest effect near 50%,
# and that 0% and 100% are infinitely far away in log-odds terms.
# The predictor age is given a coefficient of 0.059.
# This means that for every year older a speaker is, the model says their log-odds
# of producing coda /r/ increases by 0.059 - so the odds of /r/ go up by 6.1%.
exp(0.059) # 1.061
# The predictor "set" has 12 coefficients, one for each lexical set.
# However, it is important to understand that these coefficients add up to zero.
# Each one represents the difference from the mean for that lexical set.
# We see that the FIR set most favors coda /r/, with a coefficient of 1.109.
# We see that the LETTER set least favors coda /r/, with a coefficient of -1.218.
# For factors, Rbrul also gives the # of tokens and raw proportion, for each level.
# Rbrul also gives the number of tokens, the raw proportion of /r/ ("1/1+0"),
# and the VARBRUL (GoldVarb) factor weight for each category.
# Note how positive log-odds correspond to factor weights over 0.5,
# negative log-odds correspond to factor weights under 0.5,
# and zero log-odds (for the HEARD lexical set) corresponds to 0.5 exactly.
# At the bottom of the output, we see the results for the random effects.
# First we will discuss word. Here are the top five and bottom five words.
## $word (random)
## random intercept tokens 1/1+0 centered factor weight
## for 0.381 39 0.769 0.594
## Kerr 0.375 39 0.744 0.593
## fur 0.373 39 0.795 0.592
## moored 0.324 32 0.656 0.581
## car 0.322 40 0.675 0.58
## ... ... ... ... ...
## Kirsty -0.346 38 0.658 0.414
## daughter -0.384 33 0.333 0.405
## turtle -0.472 40 0.575 0.384
## Kirriemuir -0.484 40 0.425 0.381
## person -0.64 40 0.450 0.345
# Remember that we already have a predictor for lexical set. After taking set
# into account, the word effects show which individual words favored or disfavored
# coda /r/. They show the deviations of each word from the prediction for their set.
# The word that behaved the most differently from its set's average was "person".
# Under set, we see that overall, the HEARD lexical set contained 62.5% coda /r/.
# But under word, we note that "person" contained /r/ only 45% of the time.
# So "person" receives a negative random effect coefficient: -0.64.
# In the other direction, "for" had 77% /r/ while the NORTH set as a whole had 67%.
# So "for" receives a positive random effect coefficient: 0.381.
# Immediately above the individual words, there is a kind of summary row.
## $word (random)
## intercept tokens 1/1+0 centered factor weight
## std dev 0.392 1597 0.609 ...
# Based on this sample data, the population of all words is estimated to vary
# with a standard deviation of 0.392. (A normal distribution is assumed.)
# Under subject, we see a much higher number in the "std dev" position: 2.336.
## $speaker (random)
## intercept tokens 1/1+0 centered factor weight
## std dev 2.336 1597 0.609 ...
# This tells us that subjects are much more variable, or idiosyncratic, than words.
# In fact, this a quite a large subject standard deviation.
# This suggests that we may need to find other between-subject predictors,
# besides age and gender, that help condition the use of coda /r/.
# In fact, we have left social class out of the above model. Including it would
# likely reduce the "left-over" subject standard deviation.
# _________________________________________________________________________________
# EXERCISE 3
# Using the same Gretna data set, fit a model with:
# r2 as the binary response (1 as the application value)
# age, set, and position as fixed effects
# no fixed-effect interactions
# speaker and word as random intercepts
# no random slopes
# Does the "position" variable (e.g. "car" vs. "card") come out as significant?
# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Which speaker favors coda /r/ the most, taking their age into account?
# Who favors it least for their age?
# _________________________________________________________________________________
# MIXED-EFFECTS REGRESSION IN R
# Let's enter "0" to exit Rbrul and return to R.
# We'll then perform the same regressions outside Rbrul that we did inside Rbrul.
# First, let's make a copy of Rbrul's datafile, and call it "gretna".
gretna <- datafile
# If we never used Rbrul, we might load the gretna data with a command like this:
gretna <- read.csv("http://www.danielezrajohnson.com/gretna.csv")
# As above, we want to perform a regression with r2 as the response.
# We want to estimate the fixed effects of age, gender, and lexical set.
# We want to include random intercepts for speaker and word.
# We'll use the glmer() function, from the lme4 package, to do this regression.
# Observe how we build the following regression formula:
# r2 ~ age + gender + set + (1|speaker) + (1|word)
# The response goes at the left, followed by "~".
# The predictors come after "~". We list the variable names, connected by "+".
# A random intercept for z is specified by "(1|z)".
# Rbrul, by default, uses sum contrasts, meaning that factors are expressed as
# deviations from the mean.
# Now we'll actually fit the model as "m2". After the formula, we could enter
# "data = gretna", but just putting the object name "gretna" is sufficient.
# The argument "family = binomial" sets glmer() up for a logistic, not linear,
# regression. (For linear regression, we can put "family = gaussian" or omit it.)
# The "control" argument tells the glmer() function to ignore all the warnings
# it might be producing, as well as to perform up to a million optimization steps.
# If you left off the "control = ... " you would have to troubleshoot things.
m2 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
# This fits the model and stores it in an object called m2. To summarize m2,
# with p-values provided by lmerTest (somewhat controversially), we use summary().
summary(m2)
# Notice that R does not automatically print the details of the random effects
# (words and speakers). You can access these with a function called ranef().
# But we'll focus just on the fixed effects section of the output, at the top.
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.42513 0.92666 -2.617 0.008869 **
# age 0.05851 0.01545 3.787 0.000152 ***
# gendermale 0.26210 0.76801 0.341 0.732899
# setFIR 1.59092 0.52211 3.047 0.002311 **
# setFIRE -0.11665 0.83314 -0.140 0.888650
# setFORCE 0.24630 0.43302 0.569 0.569488
# setHEARD 0.48198 0.45939 1.049 0.294094
# setLETTER -0.73635 0.39725 -1.854 0.063798 .
# setNEAR 0.23679 0.51614 0.459 0.646397
# setNORTH 0.71978 0.48012 1.499 0.133825
# setNURSE 1.07863 0.43774 2.464 0.013737 *
# setOUR 1.34082 1.26878 1.057 0.290615
# setSQUARE 0.89413 0.57327 1.560 0.118828
# setSTART 0.04209 0.44513 0.095 0.924661
# ---
# Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
# You should see some similarities between this and the Rbrul output.
# The estimate for age is the same, and has the same interpretation.
# (Of course, this output includes way too many decimal places.)
# The estmimate for gender is twice as large as it was in the Rbrul model.
# This is because R uses "treatment contrasts" (by default).
# Rbrul uses "sum contrasts" (by default).
# So while Rbrul tells us males' and females' difference from the mean,
# R tells us males' difference from females. The female group, set to 0, is absent.
# ("female" is the default baseline group because it comes first alphabetically.)
# We see the same thing at work when we look at the lexical sets.
# FIR still has the highest coefficient (1.59) and LETTER the lowest (-0.74).
# However, in the R model, these coefficients all represent the difference from
# the (absent, baseline) set, CURE, which comes first in alphabetical order.
# In situations like this, where there is no logical baseline level, we may prefer
# sum contrasts. To switch, try entering the following and fitting the model again:
options(contrasts = c("contr.sum", "contr.poly"))
m3 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
summary(m3)
# Unfortunately, in this case you have to figure out which numbered level is which.
# And you have to fill in the coefficient for the missing level of each factor,
# using the "zero-sum principle". In this sense, Rbrul's output is more friendly.
# To return to treatment contrasts, we would type:
options(contrasts = c("contr.treatment", "contr.poly"))
# In the glmer() outputs, we see some p-values. These are called Wald tests.
# They are useful because there is one for each parameter (level of a factor),
# but they are less reliable than the term-wise p-values we get from anova().
# Also, hypothesis testing with anova() is not affected by the choice of contrasts.
m.full <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
m.noage <- glmer(r2 ~ gender + set + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
m.nogender <- glmer(r2 ~ age + set + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
m.noset <- glmer(r2 ~ age + gender + (1|speaker) + (1|word),
gretna,
family = binomial,
control = glmerControl(optimizer = "bobyqa",
check.conv.grad = "ignore",
check.conv.singular = "ignore",
check.conv.hess = "ignore",
check.scaleX = "ignore",
optCtrl = list(maxfun = 1000000)))
# Just now, we fit four models. One has all three predictors (the full model).
# Each of the other three is lacking one of the terms that we wish to test.
anova(m.full, m.noage) # this tests the significance of age: p = 0.0003922
anova(m.full, m.nogender) # this tests the significance of gender: p = 0.7342
anova(m.full, m.noset) # this tests the significance of set: p = 7.099e-05
# A useful alternative to doing these tests separately involves the drop1() function.
drop1(m.full, test = "Chisq")
# Df AIC LRT Pr(Chi)
# 1353.6
# age 1 1364.2 12.569 0.0003922 ***
# gender 1 1351.7 0.115 0.7341985
# set 11 1369.9 38.250 7.099e-05 ***
# ---
# Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
# Fortunately, these p-values are the same as the ones we obtained in Rbrul.
# This is no coincidence! Behind the scenes, Rbrul runs commands like these ones.
# We saw that summary() applied to glmer() model does not give the details of the
# random effects. It gives the overall standard deviations for speakers and words:
summary(m.full)
# ...
# Random effects:
# Groups Name Variance Std.Dev.
# word (Intercept) 0.1538 0.3922
# speaker (Intercept) 5.4582 2.3363
# ...
# Rounded, these same standard deviations formed the rightmost column of the
# Rbrul output. If we want to see the individual random effects, we use ranef().
ranef(m.full)
# Now we see a log-odds pseudo-coefficient for each individual word and speaker.
# (The true random effect parameters ARE the std. dev.'s for word and speaker.)
# $word
# (Intercept)
# air 0.153680473
# barred -0.345179443
# bars 0.038372690
# ...
# $speaker
# (Intercept)
# G15F -1.39409020
# G16F 1.98312077
# G17F -3.30176041
# ...
# You will notice that the speakers and words are in alphabetical order here,
# while Rbrul places them in order of their intercept. Depending on the purpose,
# the Rbrul interface may be easier to work with. Still, it is good to know how to
# accomplish the same tasks in R proper.
# _________________________________________________________________________________
# EXERCISE 4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object "orkney". Fill in the blank to complete the read.csv() command.
orkney <- read.csv(_____________)
# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.
goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)
# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous, NORMALIZED variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression, not logistic, so use lmer(), not glmer().
# You can omit the argument "family = binomial", or use "family = gaussian".
# You shouldn't need the "control" arguments this time.
m.orkney <- lmer(_______________)
summary(m.orkney)
# Is the age (year of birth) effect statistically significant?
# To answer this, fit another model without the fixed effect of "year".
# Then compare the two models using anova().
# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in the model with "year".
# Remember that the units will be Z-scores per year.
# _________________________________________________________________________________
# This is the end of the regression workshop. Thank you for participating!
# If you have any questions, feel free to email me at danielezrajohnson@gmail.com.
# I do make it a priority to respond to (and eventually solve!) all Rbrul problems.
# SOLUTIONS
# _________________________________________________________________________________
# SOLUTION 1
# In the model m0, what is the value of the intercept?
coef(m0)[1] # the intercept is 4.4
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION 2
# This solution contains some more advanced R code, but it's not necessary.
# You could also do the "100-times test" manually, repeating Command-Enter/Ctrl-R.
p <- vector() # set up a vector to hold the p-values
for (i in 1:1000) # set up to loop 1000 times
{
x <- 1:10 # x = 1, 2, ..., 10
y <- sample(1:10) # y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
p[i] <- anova(m0, m1)[6][2,1] # captures the p-value from the anova() object
}
# How often is the p-value below .05?
# It should be below .05, on average, 5% of the time.
mean(p < .05) # 0.049
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION 3
# Using the same gretna data set, fit a model with:
# r2 as the binary response (1 as the application value)
# age, set, and position as fixed effects
# speaker and word as random effects
1 # choose variables
[Enter] # keep r2 as the response
7 # choose age as a predictor
2 # choose set as a predictor
3 # choose position as a predictor
1 # choose word as a predictor
6 # choose speaker as a predictor
[Enter] # finish choosing predictors
7 # mark age as continuous
[Enter] # finish
[Enter] # no fixed-effects interactions
1 # mark word as random intercept
6 # mark speaker as random intercept
[Enter] # finish
[Enter] # no by-word random slopes
[Enter] # no by-speaker random slopes
2 # one-level model
# Does the position variable (e.g. "car" vs. "card") come out as significant?
# The predictor for position in the coda gets a p-value of 0.734.
# Not significant!
# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Since the predictor is not significant, we shouldn't really answer this!
# Which speaker favors coda /r/ the most for their age?
# Speaker G19F2. She produces 90% coda /r/.
# Her age is 19, so her expected log-odds of /r/ is -1.812 + 19 * 0.059 = -0.691.
# Using the logistic formula, this is exp(-0.691) / (1 + exp(-0.691)) = 33.4% /r/.
# Who favors it least for their age?
# Speaker G69F. She produces 0% coda /r/.
# Her age is 69, so her expected log-odds of /r/ is -1.812 + 69 * 0.059 = 2.259
# This corresponds to exp(2.259) / (1 + exp(2.259)) = 90.5% /r/.
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION 4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object orkney.
orkney <- read.csv("http://www.danielezrajohnson.com/orkney.csv")
# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.
goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)
# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous, NORMALIZED variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression (not logistic); use "family = gaussian" or omit.
model.full <- lmer(F2.norm ~ year + (1|speaker) + (1|word), orkney.goose)
# Is the age (year of birth) effect statistically significant?
# To answer this, we fit another model without the fixed effect of "year".
model.noyear <- lmer(F2.norm ~ (1|speaker) + (1|word), orkney.goose)
# Then compare the two models using anova().
anova(model.full, model.noyear) # p = 0.06801
# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in model.full.
summary(model.full) # returns the entire model summary
# We see a coefficient of 0.0064 for "year" in the "Fixed effects".
# Remember that the units will be Z-scores (normalized units) per year.
fixef(model.full) # returns the fixed effect coefficients from the model
# We have an increase of 0.0064 Z-scores for each year increase in date of birth.
# _________________________________________________________________________________