by Daniel Ezra Johnson

(comments/questions to rbrul.list@gmail.com - in fact, this manual is a bit incomplete and out-of-date so you may want to write instead)

Rbrul is a program for analyzing linguistic data using -- but without quite having to use -- R.

It is inspired by D. Sankoff's original variable rule program VARBRUL and its successor Goldvarb, as well as by Paolillo's R-Varb.

Rbrul's first goal is to do everything Goldvarb can do, but do it better (and faster).

- Both Goldvarb and Rbrul...
- carry out multiple regression (one-level, step-up & step-down) with unbalanced binary data
- perform cross-tabulations

- Rbrul...
- reads common file formats (as well as token files)
- understands plain-text labels for groups and factors (as well as one-character codes)
- combines and excludes factors easily
- reports effects in log-odds (as well as factor weights)
- can adjust the significance threshold for building models
- can handle knock-outs, usually without recoding
- reports R-squared or pseudo-R-squared for certain models
- has no limit to the number of factor groups or factors per group

- Rbrul...
- supports continuous predictors (independent variables) like age or lexical frequency
- supports continuous responses (dependent variables) like vowel formant measurements
- fits mixed models (with random effects)
- to take into account by-speaker and by-item correlations
- to estimate between-group effects (like gender) at the same time as within-group effects (like individual speaker)

Another goal of Rbrul is to encourage its users to become comfortable in the R environment, so that they can modify Rbrul themselves, or work outside it entirely.

Rbrul has a modular architecture. It consists of a number of functions, or sub-programs. The parent function

More experienced R users may wish to update settings like

This manual is intended to get the user started with Rbrul. To emphasize the similarity with Goldvarb, it begins by analyzing the famous department store study from Labov (1966), made available in token file format by John Paolillo. As well as being familiar, this data has an unusual property among sociolinguistic data sets. With many speakers and a small, balanced number of observations per speaker, its structure does

The second section of the manual, not yet written, will walk through an analysis using more typical sociolinguistic data, where mixed model analyses are generally appropriate.

If R is not already installed, download it from here.

The Rbrul program is a plain text file named

You are now ready to run Rbrul with the command:

The first thing Rbrul does is load several bundles of functions that it needs to support its activities. In R, these are called packages.

If you are running Rbrul for the first time, it may take a moment for it to install the packages.

The most important package it loads is Bates et al.'s

Note that Rbrul installs the required packages, but it does not update them after that. It is a good idea to periodically run the command

to make sure you have the latest versions. If doing this causes Rbrul to stop working properly, check to see if a new version of Rbrul is available.

Note: Rbrul has been tested using

If you are having trouble installing Rbrul, try the following steps:

1. Download and install the latest version of R for Mac or Windows.

2. Under Packages, choose "Package Installer" [Mac] or "Install package(s)" [Windows]. Choose a mirror near you [Windows]. Hold down Command [Mac] or Ctrl [Windows] and select the following four packages: "boot", "Hmisc", "lattice", and "lme4".

3. Run the following four commands:

> library(Hmisc)

> library(lattice)

> library(lme4)

> rbrul()

The Main Menu displays the name of the currently loaded data file, if any, and a summary of its structure. If there is no R object called

MAIN MENU

1-load/save data 2-adjust data

4-crosstabs 5-modeling 6-plotting

8-restore 9-reset 0-exit

The Main Menu options do the following things:

We will select option

If there is a current datafile, Rbrul prompts the user to save it under a new filename. Then it displays a numbered listing of all the files in the current directory. You can ascend through the directory structure by entering

Rbrul supports several file formats, but does not detect the format automatically. Once you have selected a file, it will ask what separates the columns: commas, tabs, or nothing (i.e. a Goldvarb token file).

If commas (as in Microsoft Excel .csv files) or tabs (as in tab-delimited text files) are chosen, Rbrul expects that the data file has a header row containing the names of the variables, and that the data in the remaining rows matches the header row. Though creating an Excel .csv file is the recommended way to get your data into R, be sure to remove any "blank" rows or columns to the right or at the bottom of the file, as these can cause problems when they are not actually blank. It is also probably a good idea not to have any unusual characters in your data file. If your file contains spaces or strange characters, they may be converted to periods in Rbrul's output. The character

The third option is designed for reading in Goldvarb token files. If this option is chosen, Rbrul will display the first few lines of the token file, and prompt the user to name the columns. It will consider all lines beginning with

To load the department store datafile, assuming we have saved it under its original name of

[20] "Rbrul_15.R"

[21] "ds.tok"

[22] "fig10 script.R"

...

No data loaded.

Type number of directory or data file to load?

(press 0 for parent directory, or Enter to keep current file)

1:

What separates the columns? (c-commas t-tabs tf-token file)

1:

1 2 3 4 5 6 ...

1 S n 4 NA NA ...

1 S n 4 NA NA ...

Enter labels for the columns (press Enter twice when done)

1:

2:

3:

4:

5:

Current data file is: /Users/dej/ds.tok

Current data structure:

r (integer with 2 values): 1 0

store (factor with 3 values): S M K

emphasis (factor with 2 values): n e

word (factor with 2 values): 4 F

We enter

We next enter

All these functions operate in a similar manner. For the department store data, we will use

1:

Factor(s) of store to recode together? (1-K 2-M 3-S Enter-done)

1:

2:

Recode K as what?

1:

Factor(s) of store to recode together? (1-Klein's 2-M 3-S Enter-done)

1:

2:

Recode M as what?

1:

Factor(s) of store to recode together? (1-Klein's 2-Macy's 3-S Enter-done)

1:

2:

Recode S as what?

1:

Factor(s) of store to recode together? (1-Klein's 2-Macy's 3-Saks Enter-done)

1:

Recode to new column? (Yes-type new column name No-press Enter)

1:

We would then repeat this process (not shown) for the variable

r (integer with 2 values): 1 0

store (factor with 3 values): Saks Macy's Klein's

emphasis (factor with 2 values): normal emphatic

word (factor with 2 values): fouRth flooR

As in Goldvarb, it is always a good idea to cross-tabulate your data by some of the factors of interest, before performing any kind of multiple regression. By calling the R function

Since we have not chosen a response variable yet, we will be prompted to do so, and to pick the application value(s) in the case of a factor. In this case, we will choose

The example below cross-tabulates by word in the columns and

1:

Cross-tab: factor for rows? (1-r 2-store 3-emphasis Enter-none)

1:

Cross-tab: factor for 'pages'? (1-r 3-emphasis Enter-none)

1:

Cross-tab: variable for cells? (1-response proportion/mean Enter-counts)

1:

counts

word

store fouRth flooR total

Klein's 112 104 216

Macy's 175 161 336

Saks 95 82 177

total 382 347 729

...

Cross-tab: variable for cells? (1-response proportion/mean Enter-counts)

1:

Choose response (dependent variable) by number (1-r 2-store 3-emphasis 4-word)

1:

Type of response? (1-continuous Enter-binary)

1:

Choose application value(s) by number? (1-0 2-1)

1:

proportion of r = 1

word

store fouRth flooR total

Klein's 0.080 0.115 0.097

Macy's 0.263 0.491 0.372

Saks 0.337 0.634 0.475

total 0.228 0.412 0.316

The first crosstab allows us to see that this dataset is well-balanced across

The second crosstab shows the cell proportions, where we observe the well-known regular patterns: increased use of (r-1) going up the social scale from Klein's to Macy's to Saks, and more (r-1) in stressed, word-final

We also observe evidence of an interaction here, as social stratification seems to be somewhat greater for the word

Choosing

This function allows us to decide which variables will be considered for inclusion in a regression model, and in what way. First it asks for a single response (dependent) variable, and determines its type (binary or continuous). Then it asks for a list of the predictor variables (explanatory, or independent variables), and asks us to specify their type: whether they are categorical or continuous, and whether they should be treated as fixed or random effects.

If the response is binary, it sets Rbrul up to perform logistic regression; otherwise, linear regression. If random effects are included, then mixed models will be fit using the

In general, if your data includes a large number of potential independent variables, it is not always the best idea to throw them all into a stepwise regression. This is especially true when some of the variables are correlated. In the simplified department store data, there are only three independent variables, and we have every reason to consider them all from the outset. All variables are categorical (factors), and there is no candidate for a random effect, when we remember that each observation at a given level of

We enter

1:

Type of response? (1-continuous Enter-binary)

1:

Choose application value(s) by number? (1-0 2-1)

1:

Choose predictors (independent variables) by number (2-store 3-emphasis 4-word)

1:

2:

3:

Are any predictors continuous? (2-store 3-emphasis 4-word Enter-none)

1:

Any grouping factors (random effects)? (2-store 3-emphasis 4-word Enter-none)

1:

No model loaded.

Current variables are:

response.binary: r

fixed.categorical: store emphasis word

We see at the bottom of this output that we have still not loaded any model (that comes next) but that the variables have been assigned correctly.

We are now ready to run a regression analysis: to build a model that shows the effect of our predictors on our response variable. Note that there are statistical assumptions behind any type of regression, and if these are not met, the results will be more or less invalid. More information about regression assumptions can be found here (for linear regression) and here (for logistic regression).

On a related note, it is recommended to make graphical plots of your data in conjunction with, if not before, conducting regression analysis. This will enable you to judge whether regression is appropriate, and to make better sense of its results.

The four modeling functions,

The

All these model-building variations have their place. The single most useful one is probably the step-down, but the step-up procedure can be easier to follow, so we will now walk through what happens if we select option

STEP 0 - null model

$misc

deviance df intercept mean

908.963 1 -0.775 0.316

The

The deviance is -2 times the log-likelihood. If two logistic models are nested -- fit to the same data, and one model is a subset or 'special case' of the other -- then the difference in deviance can be tested against a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters of the two models. This is called a likelihood-ratio test. For example, if we try collapsing two factors into one within a factor group, this represents one degree of freedom. If the difference in deviance is greater than 3.84, then the simpler model can be rejected with p < 0.05. The R command to obtain a p-value where

> pchisq(3.84,1,lower.tail=F)

[1] 0.05004352

For linear models, F-tests are preferred. In general, the software estimates the significance of a predictor by assessing the benefit it makes to model fit, but deducting points for how much it increases model complexity. We see this in action in Step 1, where each of the three predictors is added to the model in turn.

Trying store...

$store

fixed logodds tokens 1/1+0 uncentered weight

Saks 0.849 177 0.475 0.695

Macy's 0.428 336 0.372 0.599

Klein's -1.277 216 0.097 0.214

$misc

deviance df intercept mean uncentered input prob

826.233 3 -0.951 0.316 0.284

Run 1 (above) with null + store better than null with p = 1.08e-18

Trying emphasis...

$emphasis

fixed logodds tokens 1/1+0 uncentered weight

emphatic 0.115 271 0.347 0.536

normal -0.115 458 0.297 0.479

$misc

deviance df intercept mean uncentered input prob

907.011 2 -0.747 0.316 0.315

Run 2 (above) with null + emphasis better than null with p = 0.162

Trying word...

$word

fixed logodds tokens 1/1+0 uncentered weight

flooR 0.433 347 0.412 0.612

fouRth -0.433 382 0.228 0.398

$misc

deviance df intercept mean uncentered input prob

880.183 2 -0.788 0.316 0.308

Run 3 (above) with null + word better than null with p = 8.11e-08

Adding store...

Here

Statistical note: since this example is logistic regression (with a binary response), the p-values come from a likelihood-ratio chi-squared test. The same test is used as the default for the fixed effects in mixed models, although more accurate results may be obtained with a simulation procedure, which can be selected under

Returning to the output of Step 1, we see that for each run, several columns of numbers are printed in the output:

Under the default setting of zero-sum contrasts, the column of

Another number, reported under

The uncentered input probability has been adjusted up or down to compensate for the adjustments to the uncentered factor weights. All in all, it is recommended to use the centered factor weights, or to look at the log-odds coefficients directly. This is mandatory if continuous predictors are in the model, because no meaningful factor weights can be assigned to them. Note that the magnitude of log-odds coefficients can be compared more fruitfully than can factor weights. The premise of logistic regression is that effects are additive on the log-odds scale, so looking at that scale directly, rather than a translation of it into factor weight probabilities, is generally to be preferred.

We noted above that a small number of

Note that multicollinearity is a separate issue from interactions, which is where the effect of one factor depends on the level of another. Currently, Rbrul does not provide very much relief in either arena.

For models without random effects, Rbrul reports the R-squared for linear models (continuous responses) as

We are now done inspecting the results of the step-up analysis. If you were curious how it turned out,

Run 6 (above) with store + word + emphasis better than store + word with p = 0.0661

BEST STEP-UP MODEL IS WITH store (1.08e-18) + word (8.18e-09) [A]

Current model file is: unsaved model

store (1.08e-18) + word (8.18e-09) [A]

Current variables are:

response.binary: r

fixed.categorical: store emphasis word

MODELING MENU

...

With this option, Rbrul allows you to save models to disk and retrieve them. The function works similarly to Load/Save Data. It first prompts you to save the current model in the current directory using a filename of your choice. Then, regardless of whether you have saved the model, it prompts you to load a saved model from any directory. R modeling objects usually include (invisibly) the data from which the model was fit. This means that in theory, if you load a saved model, you will be able to make plots without having to reload the data separately.

Under this menu option users can set the value of several global settings, most of which apply to the modeling functions. The function asks about each setting in turn; the default option can always be selected by pressing

If

Treatment contrasts appear quite different, although they are really just a different way of conveying the same information about the different effects of factors on a response variable. With treatment contrasts, one level of each factor (one factor in each factor group) is chosen as the baseline. The effects of the other factors are expressed in terms of their difference from the baseline. So if

Since going through the entire

...

8-restore 9-reset 0-exit

1:

> decimals <- 5

> rbrul()

...

MAIN MENU

All Rbrul settings and loaded objects should remain intact when exiting and re-entering the program. Depending on the settings of your R application and whether you save your workspace, they may persist even longer than that. If you have adjusted your data and wish to restore the original version, choose

A picture is worth a thousand words. And a picture is usually worth more than a table of numbers, too. Rbrul's graphics capabilities will be the focus of future development. At present, the Plotting Menu has only one option,

Plotting variables can be chosen from

r (factor with 2 values): 1 0

store (factor with 3 values): Saks Macy's Klein's

emphasis (factor with 2 values): normal emphatic

word (factor with 2 values): fouRth flooR

Data variables: 1-r 2-store 3-emphasis 4-word

Current model file is: unsaved model

store (1.08e-18) + word (8.18e-09) [A]

Model variables: 5-r 6-store 7-emphasis 8-word 9-predictions

In this case, the only difference between the

Choose variable for y-axis?

1:

Choose variable for x-axis?

1:

Separate (and color) data according to the values of which variable? (press Enter to skip)

1:

Also show data (in black) averaged over all values of emphasis? (1-yes Enter-no)

1:

Split data into horizontal panels according to which variable? (press Enter to skip)

1:

Split data into vertical panels according to which variable? (press Enter to skip)

1:

Type of points to plot? (raw points not recommended for binary data)

(0-no points 1-raw points Enter-mean points)

1:

Scale points according to the number of observations?

(enter size factor between 0.1 and 10, or 0 to not scale points)

1:

Type of lines to plot (raw lines not recommended for binary data)?

0-no lines 1-raw lines Enter-mean lines)

1:

Add a reference line? (1-diagonal [y=x] 2-horizontal [y=0] Enter-none)

1:

In the above dialogue,

A plot like this is more informative than a cross-tabulation of the same data. The two main effects of

Choose variable for y-axis?

1:

Choose variable for x-axis?

1:

Separate (and color) data according to the values of which variable? (press Enter to skip)

1:

Also show data (in black) averaged over all values of word? (1-yes Enter-no)

1:

Split data into horizontal panels according to which variable? (press Enter to skip)

1:

Split data into vertical panels according to which variable? (press Enter to skip)

1:

Type of points to plot? (raw points not recommended for binary data)

(0-no points 1-raw points Enter-mean points)

1:

Scale points according to the number of observations?

(enter size factor between 0.1 and 10, or 0 to not scale points)

1:

Type of lines to plot (raw lines not recommended for binary data)?

0-no lines 1-raw lines Enter-mean lines)

1:

Add a reference line? (1-diagonal [y=x] 2-horizontal [y=0] Enter-none)

1:

Here we plot the model predictions on the x-axis, and the observed means for

Currently Rbrul does not handle interactions in a completely integrated way. However, if a model has been created (or loaded), a new option

...

11-test interactions

1:

Significance of interaction between store and word: p = 0.344

The Goldvarb software will not perform logistic regression if for any factor in any factor group, the data is invariant (i.e. the response proportion is 0% or 100%). I am not sure whether this restriction was originally computational or philosophical. Certainly it makes sense that if there truly are invariant contexts, they should be excluded from an analysis of variation. But when token numbers are small and probabilities are low or high, we will often observe cells that are unlikely to be "structurally" invariant, but do happen to be "knockouts".

Now it is true that it is difficult to accurately assess the response probability in a cell showing 0/10 or 10/10. But Goldvarb's reaction - that it not only cannot estimate the probability of such cells, but also will not estimate the probability of other cells of the same factor and, while it's at it, will not provide estimates for any other factor groups in the analysis - is certainly extreme.

For whatever reasons (meaning that I do not understand the reasons), the modern regression algorithms in

Computationally at least, this is an advance over Goldvarb, and we should not react suspiciously to the suggestion that computational statistics has made progress in the past 35 years or so. However, Guy (1988) argues that it is good practice to exclude invariant (and even near-invariant) contexts, no matter what. Rbrul allows this, but it also usually allows an analysis to be conducted on all of the available data, if desired.

Because Labov's department store data is so familiar, we might overlook that it has a very unusual structure for a sociolinguistic dataset. At each setting of the independent variables (say,

Because there are so many speakers, it is valid to make inferences about the populations (roughly, different social classes) that the speakers at the three stores represent. This would not be true if Labov had repeatedly queried only two or three people in each department store. In that case we would rightly wonder whether we were looking at a social class effect or "only" individual-speaker variation.

The opposite point can be made regarding the words in the study. Here we would be rather imprudent if we assumed that the behavior of

Coming from a kind of experiment, the department store tokens are also fairly well balanced across each independent variable, another feature which distinguishes this data from the majority of interview-based datasets. In real life, certain linguistic environments are less common than others and will always be underrepresented in the data.

In the last ten years, statistical methods have been developed which are appropriate for the kind of data sociolinguists usually work with: unbalanced data sets with internal correlations deriving from the tokens' being grouped by the two crossed factors of speaker and lexical item. Fitting regression models of this type is computationally difficult. Indeed, only since the release of Bates et al.'s

Psycholinguists have begun to take advantage of mixed models (see the

To understand why this is so, it is helpful to make a three-way division among the predictors which may affect our response. This applies to both continuous and categorical predictors (and to both continuous and binary responses).

- External (between-speaker) predictors. For any given speaker, each of these takes only one value. For
example:
- age
- gender
- department store

- Internal (between-word) predictors. For any given word, each of these takes only one value. For example:
- following consonant (within the word, like in a study of vowel formants)
- lexical frequency
- whether post-vocalic /r/ is word-internal or final

- Other predictors. For these, the data for a single speaker, or a single word, can show more than one value.
For example:
- following environment (across words, like in a study of /t,d/-deletion)
- style
- emphasis

- by-speaker variation affects the significance estimates of external predictors
- by-word variation affects the significance estimates of internal predictors

Also, when the data is unbalanced or skewed (as it usually will be):

- by-speaker variation affects the coefficient estimates of external predictors
- by-word variation affects the coefficient estimates of internal predictors

The above points apply to both linear models (continuous response) and logistic models (binary response). In addition, in logistic models:

- by-speaker variation affects the coefficient estimates of internal and other predictors
- by-word variation affects the coefficient estimates of external and other predictors

With a binary response, the situation is different. Even if speakers all have the same size style effect -- for example, reading passage favors post-vocalic /r/ by a certain amount (in log-odds) over conversational speech -- if speakers differ in their overall rhoticity rate, then a fixed-effect model will underestimate the size of the style effect. In Goldvarb terms, if we did a separate run for each speaker's data, the factor weights for style would be the same in each; only the input probabilities would differ. But if we combined the data into a single analysis, the style factor weights would change, becoming less extreme and understating the true effect size.

This is because the log-odds difference between, say, 50% and 68% is roughly the same as that between 90% and 95%. So a speaker who style-shifts from 50% to 68% /r/ is exhibiting the same size effect as another speaker who shifts from 90% to 95%: 0.75 log-odds. But if these speakers' data were combined (assuming balanced token numbers) we would have an overall shift from 70% to 82%, which is only 0.64 log-odds.

Currently Rbrul can fit mixed-effect models with crossed random effects (e.g. for speaker and word), providing far more accurate estimates than Goldvarb, which would typically have to ignore these sources of variation.

A more advanced type of mixed-effect modeling involves not only random intercepts, as above, but random slopes. These capture the possibility that speakers (or words) might differ not just in their overall tendency regarding the response (different input probabilities), but also in their constraints regarding the predictors (different "grammars"). Further development of Rbrul will integrate random slopes, which can be thought of as interactions between fixed and random effects. The reader is referred to Pinheiro & Bates (2000) for further information about mixed models.

Guy, Gregory R. 1988. Advanced VARBRUL analysis. In Kathleen Ferrara et al. (eds.),

Labov, William. 1966.

Paolillo, John. 2002.

Pierrehumbert, Janet B. 2002. Word-specific phonetics. In Carlos Gussenhoven & Natasha Warner (eds.),

Pinheiro, José C. and Douglas M. Bates. 2000.

Jaeger, T. Florian and Laura Staum. 2005.

Thank you for trying Rbrul.

Please report any bugs, unusual behavior, questions, suggestions for improvement, and/or topics for discussion to

Return to DEJ Home Page