# INTRODUCTION TO REGRESSION IN R AND IN RBRUL
# Daniel Ezra Johnson
# SSS 3
# 6 July 2011
# THIS FILE CAN BE DOWNLOADED FROM: http://www.danielezrajohnson.com/regression.R
# INTRODUCTION (SAME AS GRAPHING WORKSHOP)
# You should save this file on your computer.
# Then make sure you re-open it in R.
# This file is an example of a (heavily annotated) R script.
# Instead of manually typing in R commands, you can highlight lines in a script,
# and press Command-Enter (Mac) or Ctrl-R (Windows) to execute those commands.
# Many people put scripts on one side of the screen, the R console on the other.
# This makes it easy to work in both windows.
# All the lines beginning with # are comments.
# This means that they have no effect when a section of text is run in R.
# In fact, R will ignore whatever follows a # even within a line.
# I will use in-line comments for less important notes.
# Try typing "2 + 2" into the R Console, followed by Return/Enter.
# Then highlight the "2 + 2" in this script and press Command-Enter or Ctrl-R.
# The same thing should happen in both cases.
2 + 2
# Did you get this output?
# [1] 4 # the [1] can be ignored here
# What about the following:
2 + 2 * 3
# Did you get 4 * 3 = 12?
# No - the result is 8.
# As you can see from this result, R follows the standard "order of operations".
# Unless you put parentheses, multiplication will be carried out before addition.
(2 + 2) * 3
# Now you should get the answer 12.
# We often have to use parentheses in R.
# WHAT IS R AND WHAT IS RBRUL?
# R is a programming language or "environment" for statistics and data analysis.
# With R, we can perform practically any mathematical or statistical calculations.
# We can use hundreds of built-in functions and also define our own functions.
# We can obtain statistics and regression coefficients and p-values easily.
# In this workshop, we will explore regression using R itself, and also with Rbrul.
# Rbrul is a front-end interface to some of the regression functions in R.
# Rbrul is easier and more intuitive - but R is more powerful and flexible.
# INSTALLING PACKAGES AND RBRUL
# Execute the following commands to install necessary packages.
# If you already installed some of these packages, it doesn't hurt to do it again.
install.packages("boot")
install.packages("ggplot2")
install.packages("Hmisc")
install.packages("lattice")
install.packages("lme4")
# Once the packages are installed, you load them with the library() command.
library(boot)
library(ggplot2)
library(Hmisc)
library(lattice)
library(lme4)
# We install Rbrul with the command source(), which loads an external program.
# If you have an Internet connection, run the following command:
source("http://www.danielezrajohnson.com/Rbrul.R")
# If this returns an error message, try it again.
# If you don't have access to the Internet right now,
# and you downloaded the file Rbrul.R earlier, substitute the local path:
# source("~/Desktop/Rbrul.R")
# THREE GOALS OF REGRESSION
# Traditionally, there are three reasons why we build regression models:
# One is to quantitatively ESTIMATE the effects of predictors on a response.*
# *Predictor means independent variable; response means dependent variable.
# A second reason is to use observed data to PREDICT new or unseen data.
# A third reason is to TEST HYPOTHESES (for example, significance testing).
# This workshop will focus on estimation and simple hypothesis testing.
# We will ask how big an effect an predictor has on a response (estimation).
# And we'll ask if the effect could be the result of chance (hypothesis testing).
# WHAT ARE MIXED MODELS?
# Mixed models or mixed-effects models are an advanced type of regression model.
# They're useful if data has a grouped structure (e.g. tokens grouped by speaker).
# I believe that many sociolinguistic situations require mixed-effects modeling.
# This makes it difficult to illustrate "ordinary" regression with real data!
# ORDINARY (FIXED-EFFECTS) REGRESSION IN R
# Recall from the graphing workshop that we defined a "toy" data frame.
# Here we create a similar data set with a single command:
toy2 <- data.frame(
x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
y = c(1, 3, 2, 5, 4, 7, 4, 6, 5, 7))
# Note that we could have defined x as "0:9" instead of "c(0, 1, ..., 9)".
# We can plot these points with qplot().
qplot(x = x, y = y, data = toy2)
# Here, our response (y) is numeric, and we will perform a linear regression.
# Linear regression finds the straight line that is the best fit for these points.
# We minimize the "sum of squares" - the squared distances from points to line.
# We can add this best-fit line to the plot using the geom_line argument.
qplot(x = x, y = y, data = toy2) + geom_line(stat = "smooth", method = "lm")
# Any straight line can be described by two numbers: its slope and its intercept.
# The slope of the line is the amount it goes up as you move one unit to the right.
# (If the line goes down as you move to the right, the slope will be negative.)
# The intercept of the line is its y-coordinate where x = 0 (e.g. at the y-axis).
# Just by eyeballing the line on the plot, we would say that its intercept is 2.
# And we would eyeball that its slope is a little bit more than 0.5.
# Together, these two parameters make up a simple rough model of the data.
# R fits linear regression models - finds the best line - with the lm() command.
# We feed a formula of the form "y ~ x" to lm(), and tell it where the data is.
lm(y ~ x, data = toy2)
# The relevant part of this output is the "Coefficients", which are as follows:
# (Intercept) x
# 2.0000 0.5333
# We see that we were correct in our eyeballing. The intercept is exactly 2.
# The slope is 0.53. The trend is: y goes up 0.53 for every increase of 1 in x.
# This is estimation of the parameters of the data, obtained by linear regression.
# A hypothesis we might want to test is the question of statistical significance.
# We want to know if the slope, 0.53, is significantly different than zero.
# More precisely, we want to know the chance that a slope that large would occur,
# if we just randomly threw 10 points against the wall.
# Above, we just printed the lm() model directly to the screen.
# We can also save the model and give it a name.
m1 <- lm(y ~ x, toy2)
# Then we can print it out and work with it as needed. Test the following commands:
m1 # prints the parameters of the model
coef(m1) # prints just the coefficients of the model
coef(m1)[1] # prints the intercept
coef(m1)[2] # prints the first slope (here, the only slope)
# To test the null hypothesis that the true slope is zero, we fit a null model.
# This is a model with an intercept (the 1), but no slope. The slope is fixed at 0.
m0 <- lm(y ~ 1, toy2)
m0
# _________________________________________________________________________________
# EXERCISE R1
# In the model m0, what is the value of the intercept?
# _________________________________________________________________________________
# We need to compare two models: m1 and m0.
# m1 has an intercept and a slope, while m0 just has an intercept.
# As the more complex model, m1 will always fit the data better.
# However, it is more complicated, and in general, we prefer simpler models.
# We do a statistical test to decide if the better fit is worth the complexity.
# In the case of ordinary linear regression, this is done via an F-test.
# More generally, two models are compared with a likelihood-ratio chi-squared test.
# (Note: this is also how Goldvarb compares models in its stepwise routines.)
# We compare the models, perform an F-test and obtain a p-value with anova().
# (Despite the name, this command does not necessarily carry out "ANOVA".)
anova(m0, m1) # note that the order of terms does not matter
# In the anova(m0, m1) output, focus on where it says "Pr(>F)":
# Pr(>F)
# 0.005163
# This is our p-value. It is well below the usual 0.05 threshold.
# Therefore, we tend to believe that the observed trend is real and not just noise.
# This toy example had a single predictor, x.
# Naturally, regression becomes more useful when there is more than one predictor.
# Then, we estimate several "slopes" simultaneously.
# This is called "multiple regression" (not "multivariate"!).
# We might have a formula "y ~ x1 + x2" or "y ~ x1 * x2" indicating an interaction.
# _________________________________________________________________________________
# EXERCISE R2
# If the y values were chosen at random, any trend would have to be due to chance.
# Repeat the following commands several times, and keep an eye on the p-values.
x <- 1:10 # x = 1, 2, ..., 10
y <- sample(1:10) # y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
anova(m0, m1)
# How often is the p-value below 0.05?
# _________________________________________________________________________________
# ORDINARY (FIXED-EFFECTS) REGRESSION IN RBRUL
# R works by entering commands (or sets of commands) at the > prompt.
# Rbrul is based on text menus and interacts with the user with questions.
# You can cut and paste multiple lines into Rbrul at once, but it's not a great idea.
# We will attempt to carry out the same regression as we did above, in Rbrul.
# First, note that Rbrul always uses the object "datafile" to hold its data.
# We can load any file with Rbrul, but since we already have our "toy" data frame:
datafile <- toy2
# We start Rbrul with the command rbrul(). We see the data structure and main menu.
rbrul()
## Current data structure:
## x (numeric with 10 values): 0 1 2 3 4 ...
## y (numeric with 7 values): 1 3 2 5 4 ...
## Total tokens: 10
## MAIN MENU
## 1-load/save data 2-adjust data
## 4-crosstabs 5-modeling 6-plotting
## 8-restore data 9-reset 0-exit
# To fit a regression model, we type "5" for "modeling", and press Return/Enter.
5
# We then see the modeling menu:
## MODELING MENU
## 1-choose variables 2-one-level 3-step-up 4-step-down 5-step-up/step-down
## 6-plotting 8-settings 9-main menu 0-exit
## 10-chi-square test
# The first thing we have to do is choose our variables.
# For example, we have to tell Rbrul which variable is the response.
# We enter "1" to choose variables.
1
## Choose response (dependent variable) by number (1-x 2-y).
# To choose y as the response variable, we type "2" followed by Return/Enter.
2
## Type of response? (1-continuous Enter-binary)
# Here we type "1" plus Return/Enter, because y is a numeric/continuous variable.
1
## Choose predictors (independent variables) by number (1-x).
# Here we only have one choice: we type "1" (plus Return/Enter) to select x.
1
## Are any predictors continuous? (1-x Enter-none)
# Indeed, our predictor x is a numeric/continuous predictor, so we enter "1".
1
# Since there are no other predictors, Rbrul returns us to the modeling menu.
# There is only one predictor being considered for our model: x.
# Therefore, it doesn't matter if we choose "step-up", "step-down", or "one-level".
# Stepwise modeling can be dangerous, so we will get in the "one-level" habit,
# entering "2" for a one-level analysis.
2
## ONE-LEVEL ANALYSIS WITH x (0.00516)
##
## $x
## continuous units
## +1 0.533
##
## $misc
## deviance df intercept grand mean R2
## 12.933 2 2 4.4 0.645
# The above output is Rbrul's summary of the model with response y and predictor x.
# We see the p-value for x at the top of the summary: 0.00516.
# We see the slope, 0.533, under the heading "units" in the "x" section.
# We see the intercept, 2, under the heading "intercept" in the "misc" section.
# We also see other information, like the overall mean and the R-squared value.
# MIXED-EFFECTS REGRESSION IN RBRUL
# We will now run a regression on a real linguistic data set.
# The variable is post-vocalic or coda /r/, and there are 1597 tokens.
# 40 speakers in Gretna read a word list that contained 40 or so coda /r/s.
# The variable was coded narrowly, but we focus on zero vs. any other realization.
# This is a binary dependent variable, so we will be running a logistic regression.
# Predictors are modeled as affecting the LOG-ODDS of the response in a linear way.
# Log-odds are a way of taking the 0-to-1 probability scale, and stretching it out,
# so that it covers all numbers from negative infinity to positive infinity.
# A log-odds of 0 corresponds to 0.5 probability (or to no change in probability).
# With coda /r/ as our variable, a log-odds of 0 corresponds to 50% /r/.
# On the log-odds scale, the difference between 50% and 70% /r/ is 0.85 log-odds.
# This is similar to the difference between 70% and 85%, 85% and 93%, 93% and 97%..
# In logistic regression, differences near 100% (or near 0%) "count for more"...
# To load the Gretna datafile, first enter "9" to go to the "Main Menu".
# If you have saved the Gretna file on your computer:
1 # load/save data
[Enter] # do not save current data
c # commas separate the columns
[choose file] # navigate to the file and choose it with the wizard
# If you have not saved the Gretna file on your computer, and have Internet access:
0 # to exit Rbrul
datafile <- read.csv("http://www.danielezrajohnson.com/gretna.csv")
rbrul() # to re-enter Rbrul
# We should now be at the main menu in Rbrul.
# The "Current data structure" shows:
## word (factor with 58 values): air barred bars beard car ...
## set (factor with 12 values): SQUARE START NEAR NORTH FORCE ...
## position (factor with 2 values): coda-final coda-internal
## r4 (factor with 4 values): approximant zero tap trill
## r2 (integer with 2 values): 1 0
## speaker (factor with 40 values): G15F G16F G17F G17F2 G18F ...
## age (integer with 24 values): 15 16 17 18 19 ...
## gender (factor with 2 values): female male
## Total tokens: 1597
# Our response is going to be r2, which is coded 0 for zero /r/ and 1 otherwise.
# We will do a regression to estimate the size of three "fixed effects":
# set - how does the preceding vowel affect coda /r/?
# age - how does speaker age affect coda /r/?
# gender - how does speaker gender affect coda /r/?
# We may not be specifically interested in individual words and speakers,
# but we must take them into account when the data is grouped by word and speaker.
# That is, the data has multiple observations of each word and of each speaker.
# Word is nested within set (each word is always in the same lexical set).
# Similarly, speaker is nested within age and gender.
# This matters because if there is a lot of by-speaker or by-word variation,
# we might see what looks like large predictor effects, but they are spurious.
# We control for this by using two "random effects" for speaker and word.
# Random effects usually have many levels, and nest within predictors of interest.
# From the Rbrul main menu, enter the following commands (manually):
5 # modeling menu
1 # choose variables
5 # choose r2 as the response
[Enter] # r2 is a binary variable
2 # we will model what favors coda /r/, not what disfavors it
7 # choose predictor age
8 # choose predictor gender
2 # choose predictor set
1 # choose predictor word
6 # choose predictor speaker
[Enter] # finish choosing predictors
# Rbrul should now be asking "Are any predictors continuous?" Age is continuous.
7 # mark age as continuous
[Enter] # finish (the other predictors are categorical)
# Rbrul now asks if there are any grouping factors. We select speaker and word.
# (These are also called random intercepts, becaus each speaker and each word
# will receive their own pseudo-coefficient!)
1 # mark word as random intercept
6 # mark speaker as random intercept
[Enter] # finish (the other predictors are fixed effects)
# Rbrul now asks about interactions. We will skip this by pressing Enter/Return.
[Enter] # no interactions in this model
# Rbrul now takes us back to the modeling menu.
# We will choose "2" for a one-level model.
# A one-level model is like a step-down model except it only does the first step.
2 # choose one-level model
[Enter]
# After a moment, a long output results, including estimates for word and speaker.
# Scroll back to the top, where we see coefficients for age, gender, and set.
## ONE-LEVEL ANALYSIS WITH word (random) and speaker (random) and set (7.1e-05) +
## age (0.000392) + gender (0.736)
##
## $gender
## factor logodds tokens 1/1+0 centered factor weight
## male 0.131 795 0.630 0.533
## female -0.131 802 0.587 0.467
##
## $set
## factor logodds tokens 1/1+0 centered factor weight
## FIR 1.111 119 0.748 0.752
## OUR 0.861 7 0.857 0.703
## NURSE 0.598 225 0.689 0.645
## SQUARE 0.413 73 0.671 0.602
## NORTH 0.239 125 0.672 0.559
## HEARD 0.000 152 0.625 0.5
## FORCE -0.236 180 0.617 0.441
## NEAR -0.245 86 0.616 0.439
## START -0.440 155 0.606 0.392
## CURE -0.482 112 0.562 0.382
## FIRE -0.599 12 0.667 0.355
## LETTER -1.219 351 0.470 0.228
##
## $age
## continuous logodds
## +1 0.059
# First look at the top line. It says we have word and speaker as random effects.
# These random effect terms are not tested for significance.
# Then it gives (lexical) set, with a p-value of 7.1e-05, which means 0.000071.
# Lexical set, then, is a highly significant predictor of coda /r/.
# Age is given a p-value of 0.000392, which is also highly significant.
# Gender is not significant: p = 0.736.
# Turning to the estimates, we see that the first column of numbers is in log-odds.
# Log-odds is the standard unit for reporting the results of logistic regressions.
# We just remember that a given change in log-odds has its biggest effect near 50%,
# and that 0% and 100% are infinitely far away in log-odds terms.
# The predictor age is given a coefficient of 0.059.
# This means that for every year older a speaker is, the model says their log-odds
# of producing coda /r/ increases by 0.059 - so the odds of /r/ go up by 6.1%.
exp(0.059) # 1.061
# The predictor of set has 12 coefficients, one for each lexical set.
# However, it is important to understand that these coefficients add up to zero.
# Each one represents the difference from the mean for that lexical set.
# We see that the FIR set most favors coda /r/, with a coefficient of 1.111.
# We see that the LETTER set least favors coda /r/, with a coefficient of -1.219.
# For factors, Rbrul also gives the # of tokens and raw proportion, for each level.
# Rbrul also gives the number of tokens, the raw proportion of /r/ ("1/1+0"),
# and the VARBRUL factor weight for each category.
# Note how positive log-odds correspond to factor weights over 0.5,
# negative log-odds correspond to factor weights under 0.5,
# and zero log-odds (for example, for the HEARD lexical set) corresponds to 0.5.
# At the bottom of the output, we see the results for the random effects.
# First we will discuss word. Here are the top five and bottom five words.
# $word
# random logodds tokens 1/1+0 centered factor weight std dev
# for 0.381 39 0.769 0.594 0.392
# Kerr 0.375 39 0.744 0.593 0.392
# fur 0.373 39 0.795 0.592 0.392
# moored 0.324 32 0.656 0.581 0.392
# car 0.322 40 0.675 0.58 0.392
# ... ... ... ... ... ...
# Kirsty -0.347 38 0.658 0.414 0.392
# daughter -0.384 33 0.333 0.405 0.392
# turtle -0.473 40 0.575 0.384 0.392
# Kirriemuir -0.484 40 0.425 0.381 0.392
# person -0.641 40 0.450 0.345 0.392
# Remember that we have a predictor for lexical set. After taking set into account,
# the word effects show which individual words favored or disfavored coda /r/.
# They show the deviations of each word from the prediction for their set.
# The word that behaved the most differently from its set's average was "person".
# Under set, we see that overall, the HEARD lexical set contained 62.5% coda /r/.
# But under word, we note that "person" contained /r/ only 45% of the time.
# So "person" receives a negative random effect coefficient: -0.641.
# In the other direction, "for" had 77% /r/ while the whole NORTH set had 67%.
# So "for" receives a positive random effect coefficient: 0.381.
# The righthand column says that based on this data, the population of words
# is estimated to vary with a standard deviation of 0.392.
# Under subject, we see a much higher number in this column: 2.339.
# This tells us that subjects are much more variable, or idiosyncratic, than words.
# In fact, this a very large subject standard deviation.
# This suggests that we may need to find other between-subject predictors,
# besides age, that condition the use of coda /r/.
# _________________________________________________________________________________
# EXERCISE R3
# Using the same gretna data set, fit a model with:
# r2 as the binary response (1 as the application value)
# speaker and word as random effects
# age, set, and position as fixed effects
# Does the position variable (e.g. "car" vs. "card") come out as significant?
# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Which speaker favors coda /r/ the most for their age?
# Who favors it least for their age?
# _________________________________________________________________________________
# MIXED-EFFECTS REGRESSION IN R
# Let's enter "0" to exit Rbrul and return to R.
# We'll then perform the same regressions outside Rbrul that we did inside Rbrul.
# First, let's make a copy of Rbrul's datafile, and call it "gretna".
gretna <- datafile
# If we never used Rbrul, we might load the gretna data with a command like this:
# datafile <- read.csv("http://www.danielezrajohnson.com/gretna.csv")
# Next, we want to perform a regression with r2 as the response.
# We want to estimate the fixed effects of age, gender, and lexical set.
# We want to include random intercepts for speaker and word.
# We'll use the glmer() function, from the lme4 package, to do this regression.
# Observe how we build the following regression formula:
glmer(r2 ~ age + gender + set + (1|speaker) + (1|word), gretna, family = binomial)
# The response goes at the left, followed by "~".
# The predictors come after "~". We list the variable names, connected by "+".
# A random intercept for z is specified by "(1|z)".
# We then enter the name of the data frame in which these variables are defined.
# You can enter "data = gretna" but just putting "gretna" is sufficient.
# And "family = binomial" sets glmer() up for a logistic, not linear, regression.
# (For a linear regression, we can omit the default "family = gaussian" argument.)
# We'll focus on the fixed effects section of the output.
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.42953 0.92566 -2.625 0.008674 **
# age 0.05863 0.01541 3.805 0.000142 ***
# gendermale 0.26253 0.77097 0.341 0.733468
# setFIR 1.59319 0.52620 3.028 0.002464 **
# setFIRE -0.11684 0.82755 -0.141 0.887726
# setFORCE 0.24663 0.43345 0.569 0.569357
# setHEARD 0.48257 0.46041 1.048 0.294577
# setLETTER -0.73712 0.39686 -1.857 0.063254 .
# setNEAR 0.23708 0.51627 0.459 0.646079
# setNORTH 0.72072 0.48125 1.498 0.134242
# setNURSE 1.08000 0.43945 2.458 0.013986 *
# setOUR 1.34269 1.27786 1.051 0.293381
# setSQUARE 0.89527 0.57639 1.553 0.120365
# setSTART 0.04202 0.44436 0.095 0.924656
# We see some similarities between this and the Rbrul output.
# The estimate for age is the same, and has the same interpretation.
# The estmimate for gender is twice as large as it was in the Rbrul model.
# This is because R uses "treatment contrasts" (by default).
# Rbrul (by default) uses "sum contrasts".
# So while Rbrul tells us males' and females' difference from the mean,
# R tells us males' difference from females.
# ("female" is the default baseline group because it comes first alphabetically.)
# We see the same thing at work when we look at the lexical sets.
# FIR still has the highest coefficient (1.59) and LETTER the lowest (-0.74).
# However, in the R model, these coefficients are all with respect to CURE,
# which comes first in alphabetical order.
# In situations like this, where there is no logical baseline level, we may prefer
# sum contrasts. To switch, try entering the following and fitting the model again:
options(contrasts = c("contr.sum", "contr.poly"))
glmer(r2 ~ age + gender + set + (1|speaker) + (1|word), gretna, family = binomial)
# However, in this case you have to figure out which numbered level is which.
# And you have to fill in the coefficient for the last level of each factor,
# using the "zero-sum principle".
# To return to fitting models using treatment contrasts, we would type:
options(contrasts = c("contr.treatment", "contr.poly"))
# In the glmer() output, we see some p-values. These are called Wald tests.
# They are useful because there is one for each level of a factor.
# But they are less reliable than the overall p-values we get from anova().
# Also, hypothesis testing with anova() is not affected by the choice of contrasts.
m.full <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
gretna, family = binomial)
m.noage <- glmer(r2 ~ gender + set + (1|speaker) + (1|word),
gretna, family = binomial)
m.nogender <- glmer(r2 ~ age + set + (1|speaker) + (1|word),
gretna, family = binomial)
m.noset <- glmer(r2 ~ age + gender + (1|speaker) + (1|word),
gretna, family = binomial)
# Above, we fit four models. One has all three predictors (the full model).
# Each of the other three is lacking one of the terms that we wish to test.
anova(m.full, m.noage) # this tests the significance of age: p = 0.0003917
anova(m.full, m.nogender) # this tests the significance of gender: p = 0.7357
anova(m.full, m.noset) # this tests the significance of set: p = 7.1e-05
# We obtain the same p-values as we did previously in Rbrul.
# This is no coincidence. Behind the scenes, Rbrul runs commands like these ones.
# Note that glmer() does not give the details of the random effects.
# It only gives the overall standard deviation for speakers and for words:
m.full
# ...
# Random effects:
# Groups Name Variance Std.Dev.
# word (Intercept) 0.15398 0.3924
# speaker (Intercept) 5.46910 2.3386
# ...
# These same standard deviations formed the rightmost column of the Rbrul output.
# If we also want to see the individual random effects, we use ranef().
ranef(m.full)
# Now we see a log-odds pseudo*-coefficient for each individual word and speaker.
# (The true random effect parameters are the std. dev.'s for word and speaker.)
# $word
# (Intercept)
# 0.153561149
# -0.345508931
# 0.038428662
# ...
# $speaker
# (Intercept)
# -1.39248543
# 1.98544866
# -3.30211402
# ...
# If the names of the levels (i.e. of the individual words/speakers) are missing,
# as they are here, you need to use the row.names() function to extract them.
# Things like this are why a user-friendlier interface like Rbrul may be useful.
# _________________________________________________________________________________
# EXERCISE R4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object orkney.
# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.
goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)
# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression, not logistic, so don't use "family = binomial".
# Is the age (year of birth) effect statistically significant?
# To answer this, fit another model without the fixed effect of "year".
# Then compare the two models using anova().
# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in the model with "year".
# Remember that the units will be Z-scores per year.
# _________________________________________________________________________________
# This is the end of the regression workshop. Thank you for participating!
# If you have any questions, feel free to email me at rbrul.list@gmail.com.
# SOLUTIONS
# _________________________________________________________________________________
# SOLUTION R1
# In the model m0, what is the value of the intercept?
coef(m0)[1] # the intercept is 4.4
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION R2
# This solution contains some more advanced R code.
#YOU MIGHT NOTE THAT IF THEY HIGHLIGHT IT, THEY CAN ALSO JUST HIT COMMAND-ENTER A BUNCH OF TIMES. I DID IT 20 TIMES AND IT WAS BELOW .05 EXACTLY ONCE, YAY
p <- vector() # set up a vector to hold the p-values
for (i in 1:1000) # set up to loop 1000 times
{
x <- 1:10 # x = 1, 2, ..., 10
y <- sample(1:10) # y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
p[i] <- anova(m0, m1)[6][2,1] # captures the p-value from the anova() object
}
# How often is the p-value below 0.05?
# It should be below 0.05, on average, 5% of the time.
mean(p < 0.05) # 0.049
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION R3
# Using the same gretna data set, fit a model with:
# r2 as the binary response (1 as the application value)
# age, set, and position as fixed effects
# speaker and word as random effects
1 # choose variables
[Enter] # keep r2 as the response
7 # choose age as a predictor
2 # choose set as a predictor
3 # choose position as a predictor
1 # choose word as a predictor
6 # choose speaker as a predictor
[Enter] # finish choosing predictors
7 # mark age as continuous
[Enter] # finish
1 # mark word as random intercept
6 # mark speaker as random intercept
[Enter] # finish
[Enter] # no interaction terms in the model
2 # one-level model
# Does the position variable (e.g. "car" vs. "card") come out as significant?
# The predictor for position in the coda gets a p-value of 0.734.
# Not significant!
# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Since the predictor is not significant, we shouldn't really answer this.
#
# Which speaker favors coda /r/ the most for their age?
# Speaker G19F2 favors coda /r/ the most for her age. She produces 90% coda /r/.
#
# Who favors it least for their age?
# Speaker G69F favors coda /r/ the least for her age. She produces 0% coda /r/.
# _________________________________________________________________________________
# _________________________________________________________________________________
# SOLUTION R4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object orkney.
orkney <- read.csv("http://www.danielezrajohnson.com/orkney.csv")
# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.
goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)
# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression, not logistic, so don't use "family = binomial".
model.full <- glmer(F2.norm ~ year + (1|speaker) + (1|word), orkney.goose)
# Is the age (year of birth) effect statistically significant?
# To answer this, fit another model without the fixed effect of "year".
model.noyear <- glmer(F2.norm ~ (1|speaker) + (1|word), orkney.goose)
# Then compare the two models using anova().
anova(model.full, model.noyear) # p = 0.06912
# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in model.full.
# Remember that the units will be Z-scores per year.
fixef(model.full) # returns the fixed effect coefficients from the model
# We have an increase of 0.0064 Z-scores for each year increase in date of birth.
# _________________________________________________________________________________