Data Analysis and Statistics Minicourse
From Statscourse
Please bring a laptop!
Preassignments: Install R; read chapter 1 of Baayen book and do exercises; look over the paper by Kazanina (just the methods section). See below for links.
Important: You should install the "languageR" package before the first day (and you need it before doing the exercises). This is not completely trivial, and there are some complications on Macs running the latest version of R (i.e., if you downloaded it in the last six months), so you should make sure to take care of this in advance. See below for information.
Required reading: Baayen 2008, Analyzing Linguistic Data (ALD). PDF is available here, but there are some discrepancies between the printed version and this PDF.
Recommended texts: Rice 2007, Mathematical Statistics and Data Analysis. Kutner et al. 2004, Applied Linear Regression Models. Gelman and Hill 2006, Data Analysis Using Regression and Multilevel/Hierarchical Models.
Room: SBS S328
Times: Friday, Oct. 1st: 4:00-7:00pm; Saturday, Oct. 2nd, 11:00am-2:00pm
Contents |
Goals
- To become familiar with the types of questions that are asked in the analysis of experimental data and with basic principles of experimental design
- To understand why statistical analysis is necessary and the basic principles underlying it
- To understand two of the most common statistical tests, t-tests and the analysis of variance (ANOVA), and understand when they are appropriate and when they are inappropriate
- To develop a basic familiarity with the R environment and learn a few important functions
Using R
Download and install R by going to the main R website. Before class, read and do the exercises in Chapter 1 of ALD.
Important: download and install the "languageR" package. Instructions are in Chapter one of Baayen's book, replicated here:
install.packages(c("rpart", "chron", "Hmisc", "Design", "Matrix",
"lme4", "coda", "e1071", "zipfR", "ape", "languageR"),
repos = "http://cran.r-project.org")
More important: You may not be able to do this following the instructions in Baayen's book (that is, the above command). If you are running OS X with the latest version of R, one of the important dependencies that needs to be installed before languageR will load, lme4 (for fitting mixed effects models), has no working binary package.
You need to first download this substitute package http://www.ling.umd.edu/~ewand/lme4_0.999375-35.tgz
Then install it manually by doing the following from inside R:
1. Go to the Packages & Data menu, and select "Package Installer".
2. From the dropdown box in the upper left corner, select "Local Binary Package".
3. Click the bottom right "Install..." button, and find the folder where you downloaded it, if it's not already showing. Select the file "lme4_0.999375-35.tgz" and click the "Open" button.
Caution: This file is an archive. It is a single compressed file with the extension .tgz. R can read packages with the extension .tgz, but do not extract this file to a .tar file. R will not be ablte to install from it. If your web browser automatically extracts it for you, then you need to further extract it to a folder. This will give you a folder called "lme4". Then, rather than selecting "Local Binary Package", you must select "Local Package Directory". Rather than selecting the ".tgz" file, you must select the "lme4" folder by double-clicking it, then selecting "Open."
Getting a feel for experimental data
Reading: Kazanina et al. 2008 (read the Methods section)
Notes and cautions about using continuous statistics for ordinal data like 7-point scales
Preparing your data for analysis
An inappropriately simple Python script for converting DMDX output to CSV files: dmdx_to_csv.py
Identity priming experiment data file: miniexpt_resps.dat
Identity priming experiment items file: miniexpt_items.dat
Reading data files into R:
resps <- read.table("miniexpt_resps.dat", sep=",", header=T)
Getting help on the read.table command:
?read.table
Printing the first ten rows of a data frame:
resps[1:10,]
Printing the first ten elements in the fourth column of a data frame:
resps[1:10,4]
Printing the first ten elements of the column called RT:
resps$RT[1:10]
Printing the elements of the column called RT in the rows where Trial equals 1 and Accuracy equals 0:
resps[resps$Trial==1 & resps$Accuracy==0,"RT"]
Detailed explanation of why the command above works: R Language Definition: Indexing by vectors
Reading an items file:
items <- read.table("miniexpt_items.dat",sep=",",header=T)
Merging two data frames:
data <- merge(items,resps,all=T)
Printing the first five rows of a data frame:
data[1:5,]
Getting help on the merge command:
?merge
Finding the first ten elements of the Target column that are not NAs:
data[!is.na(data$Target),"Target"][1:10]
Try to find the number of characters of the elements of Target (incorrect):
nchar(data[!is.na(data$Target),"Target"][1:10])
Determine whether Target is a character vector:
is.character(data$Target)
Find the storage mode of Target:
mode(data$Target)
Determine whether Target is a numeric vector:
is.numeric(data$Target)
Determine whether Target is a factor vector:
is.factor(data$Target)
Save the length of the elements of Target in a new column:
data$Length <- nchar(as.character(data$Target))
Remove the misleading Length values (NA elements of Target):
data$Length[is.na(data$Target)] <- NA
Make a scatterplot of the data:
plot(data$RT)
Make a scatterplot by subject:
plot(RT ~ Subject, data=data)
Compute by-subject percent accuracy using mean of by-trial accuracy:
aggregate(data$Accuracy, list(Subject=data$Subject), FUN=mean)
Plot by-subject percent accuracies:
plot(aggregate(data$Accuracy, list(Subject=data$Subject), FUN=mean))
Save the original data:
data.raw <- data
Remove Subject 3:
data <- data[data$Subject != 3,]
Remove RTs above 3000ms:
data <- data[data$RT <= 3000,]
Remove practice trials:
data <- data[!is.na(data$Condition),]
Plot by subject with a custom x-axis:
plot(RT ~ Subject, data=data, xaxt="n")
axis(1, at=2:30, cex.axis=0.7)
Obtain the mean for subject 14:
mean(data[data$Subject == 14,"RT"])
Obtain the mean for subject 22:
mean(data[data$Subject == 22,"RT"])
Obtain the standard deviation for subject 14:
sd(data[data$Subject == 14,"RT"])
Obtain the standard deviation for subject 22:
sd(data[data$Subject == 22,"RT"])
Statistical tests cheat sheet
Run a two sample t-test on the two priming conditions in the sample experiment:
pr <- data[data$Condition=="ident","RT"]
np <- data[data$Condition=="dsimp","RT"]
t.test(pr, np)
Welch Two Sample t-test data: pr and np t = -2.8834, df = 2081.836, p-value = 0.003975 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -65.77045 -12.52085 sample estimates: mean of x mean of y 717.5907 756.7363
Run an ANOVA on a 2x2 factorial design:
library(languageR)
summary(aov(RTlexdec ~ WordCategory*Voice, data=english))
Df Sum Sq Mean Sq F value Pr(>F)
WordCategory 1 0.173 0.173131 7.0438 0.007982 **
Voice 1 0.090 0.089802 3.6536 0.056011 .
WordCategory:Voice 1 0.013 0.013222 0.5379 0.463330
Residuals 4564 112.179 0.024579
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Run a linear regression on another data set from the languageR package:
summary(lm(LogRT ~ LogWordFreq*LogAffixFreq, data=danish))
Call:
lm(formula = LogRT ~ LogWordFreq * LogAffixFreq, data = danish)
Residuals:
Min 1Q Median 3Q Max
-0.65123 -0.12809 -0.01966 0.10139 0.99649
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.148690 0.074041 96.550 < 2e-16 ***
LogWordFreq -0.069570 0.015553 -4.473 7.96e-06 ***
LogAffixFreq -0.027247 0.006328 -4.306 1.71e-05 ***
LogWordFreq:LogAffixFreq 0.004816 0.001273 3.783 0.000158 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2011 on 3322 degrees of freedom
Multiple R-squared: 0.02082, Adjusted R-squared: 0.01994
F-statistic: 23.55 on 3 and 3322 DF, p-value: 4.469e-15
Run a simple binomial test on count data from the languageR package:
binom.test(xtabs(~RealizationOfRec, data=verbs))
Exact binomial test
data: xtabs(~RealizationOfRec, data = verbs)
number of successes = 555, number of trials = 903, p-value = 5.782e-12
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5819887 0.6464968
sample estimates:
probability of success
0.614618
Run a logistic regression on the same data set, including three explanatory variables (two categorical, and one continuous):
summary(glm(RealizationOfRec ~ AnimacyOfRec+AnimacyOfTheme+LengthOfTheme, data=verbs, family=binomial))
Call:
glm(formula = RealizationOfRec ~ AnimacyOfRec + AnimacyOfTheme +
LengthOfTheme, family = binomial, data = verbs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8283 -0.8955 -0.5788 1.0744 2.0273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.01976 1.14353 0.017 0.9862
AnimacyOfRecinanimate 0.49402 0.25443 1.942 0.0522 .
AnimacyOfThemeinanimate 0.94931 1.13574 0.836 0.4032
LengthOfTheme -1.04129 0.10050 -10.361 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1203.9 on 902 degrees of freedom
Residual deviance: 1059.4 on 899 degrees of freedom
AIC: 1067.4
Number of Fisher Scoring iterations: 4
Fundamentals of statistics
Construct an arbitrary distribution:
p <- c(0.1, 0.3, 0.2, 0.1, 0.3)
Plot the probability mass function:
plot(p, ylim=c(0,1), type="h")
Take a single sample from this distribution:
sample(1:5, 1, prob=p)
Take 100 samples from this distribution:
sample(1:5, 100, replace=TRUE, prob=p)
Use the xtabs command to get counts:
s <- sample(1:5, 100, replace=TRUE, prob=p)
counts <- xtabs(~s)
Find relative frequency by dividing by N:
f <- counts/100
Overplot the relative frequency:
points(f, type="p", col="red", pch="x")
Sample 10 points from a binomial distribution with a somewhat tricky to understand n and p:
s <- rbinom(10, 32, 0.2)
Find the sample mean:
mean(s)
Check that it's right:
sum(s)/10
Find the median:
median(s)
Find the variance:
var(s)
Sample 10 points from a binomial distribution with an easy to understand n and p:
s <- rbinom(10, 1, 0.5)
Find the sample mean:
mean(s)
Print the number of successes (heads):
sum(s)
Find out how probable that was by using a different binomial distribution, and observing that they're closely related:
dbinom(sum(s), 10, 0.5)
Take a sample from a standard normal distribution:
s <- rnorm(10000, 0, 1)
Get a summary of the data, including the median:
summary(s)
Plot a histogram of the data:
hist(s, xlim=c(-5,5), freq=F)
Overplot the normal density this sample was drawn from:
curve(dnorm(x,0,1), from=-5, to=5, add=T)
Find values of the cumulative distribution function:
pnorm(0, 0, 1)
pnorm(-1, 0, 1)
Find the median by using the inverse CDF:
qnorm(0.5, 0, 1)
Compare the sample median:
median(s)
Find the density of the sample median, if we really believe that it's drawn from a perfect normal distribution:
dnorm(median(s), 0, 1)
Find the probability of getting a point no larger than that:
pnorm(median(s), 0, 1)
Testing out the central limit theorem (advanced):
sums <- replicate(1000,sum(runif(50)))
hist(sums)
qqnorm(sums)
t-test
Make sure the languageR package is attached:
library(languageR)
Run a t-test on the nasal duration data against the hypothesis that the location is 0.053:
t.test(durationsOnt$DurationPrefixNasal, mu=0.053)
One Sample t-test data: durationsOnt$DurationPrefixNasal t = -1.5038, df = 101, p-value = 0.1358 alternative hypothesis: true mean is not equal to 0.053 95 percent confidence interval: 0.04561370 0.05401646 sample estimates: mean of x 0.04981508
Plot the relevant t distribution:
curve(dt(x, 101), from=-4, to=4)
Find the cumulative probability at or below the t statistic:
pt(-1.5038, 101)
Multiply this by two to obtain the p-value:
2*pt(-1.5038, 101)
Show a quantile-quantile plot:
qqnorm(durationsOnt$DurationPrefixNasal)
Run a Shapiro-Wilk test against the null hypothesis of a normal distribution:
shapiro.test(durationsOnt$DurationPrefixNasal)
Run a one-sample Wilcoxon test against the null hypothesis of a symmetric distribution centered at 0.053:
wilcox.test(durationsOnt$DurationPrefixNasal, mu=0.053)
Run a two-sample t test for a difference between plant and animal word frequencies:
simplex <- ratings[ratings$Complex == "simplex",]
freqAnimals <- simplex[simplex$Class == "animal",]$Frequency
freqPlants <- simplex[simplex$Class == "plant",]$Frequency
t.test(freqAnimals, freqPlants)
Welch Two Sample t-test data: freqAnimals and freqPlants t = 2.674, df = 57.545, p-value = 0.009739 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.1931830 1.3443152 sample estimates: mean of x mean of y 5.208494 4.439745
Run a paired t-test on the rating data:
t.test(ratings$meanWeightRating, ratings$meanSizeRating, paired=T)
Paired t-test
data: ratings$meanWeightRating and ratings$meanSizeRating
t = -36.0408, df = 80, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3566737 -0.3193460
sample estimates:
mean of the differences
-0.3380099
Demonstrate that the paired t-test is the same as running a single sample t-test on the differences:
t.test(ratings$meanWeightRating - ratings$meanSizeRating)
One Sample t-test data: ratings$meanWeightRating - ratings$meanSizeRating t = -36.0408, df = 80, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.3566737 -0.3193460 sample estimates: mean of x -0.3380099
ANOVA
Find out about the warpbreaks data:
?warpbreaks
Summarize the warpbreaks data:
summary(warpbreaks)
breaks wool tension
Min. :10.00 A:27 L:18
1st Qu.:18.25 B:27 M:18
Median :26.00 H:18
Mean :28.15
3rd Qu.:34.00
Max. :70.00
Run a one-way ANOVA on the warpbreaks data:
aov(breaks ~ tension, data=warpbreaks)
Call:
aov(formula = breaks ~ tension, data = warpbreaks)
Terms:
tension Residuals
Sum of Squares 2034.259 7198.556
Deg. of Freedom 2 51
Residual standard error: 11.88058
Estimated effects may be unbalanced
Show the results of the significance test:
summary(aov(breaks ~ tension, data=warpbreaks))
Df Sum Sq Mean Sq F value Pr(>F)
tension 2 2034.3 1017.13 7.2061 0.001753 **
Residuals 51 7198.6 141.15
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Check the assumptions of normality:
shapiro.test(warpbreaks[warpbreaks$tension=="H",]$breaks)
shapiro.test(warpbreaks[warpbreaks$tension=="M",]$breaks)
shapiro.test(warpbreaks[warpbreaks$tension=="L",]$breaks)
Run a two way ANOVA:
summary(aov(len~supp*doseLevel,data=ToothGrowth))
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.35 205.35 15.572 0.0002312 ***
doseLevel 2 2426.43 1213.22 92.000 < 2.2e-16 ***
supp:doseLevel 2 108.32 54.16 4.107 0.0218603 *
Residuals 54 712.11 13.19
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’