1 Independent-samples experimental design

Researchers interested in the factors that affect memory encoding conducted a study to test the following hypothesis:

People presented as cheaters will attract greater attention, and will therefore be more memorable than people who are presented as trustworthy.

Design and Procedure: One hundred twenty college student participants were recruited to participate in a study on attractiveness, which hid the true purpose of the investigation. In the first session, participants were asked to rate the attractiveness of people in 10 mock news paper articles, which included a photo and brief written description. Prior to the start of the session, participants were randomly assigned to one of three conditions (X); 1) cheating description, 2) neutral description, or 3) trustworthy description. A week later, participants were invited back to see how many of the people they remembered seeing (from a mix of new and previously viewed images). Their accuracy on the memory task was the primary dependent measure (Y).




Load and view the data in the console:

memory.df = read.csv("https://andrewebrandt.github.io/object/datasets/memory.csv", header = TRUE)
head(memory.df)
##   Cond.X Correct.Y
## 1      N         7
## 2      N         8
## 3      N         7
## 4      N         8
## 5      N         8
## 6      N         8


This example is based on Mealey, Daood, and Krage (1996)

2 Data wrangling

Provide clear condition labels in memory.df and change Cond.X from character to factor data type. See more on R data types in this tutorial.

library(dplyr)
memory.df <- memory.df %>%                 # overwrite data set
     mutate(Cond.X = case_when(            # change condition labels
            Cond.X == "N" ~ "Neutral",    
            Cond.X == "T" ~ "Trust",
            Cond.X == "C" ~ "Cheat")) %>%
     mutate(Cond.X = as.factor(Cond.X))    # change Cond.X from character to factor
head(memory.df)
##    Cond.X Correct.Y
## 1 Neutral         7
## 2 Neutral         8
## 3 Neutral         7
## 4 Neutral         8
## 5 Neutral         8
## 6 Neutral         8



3 Describe and plot the independent-samples data

Generate a descriptive summary on memory scores across the cheat, neutral, and trustworthy conditions:

library(psych)
Describe.df <- describeBy(
  memory.df[2],               # data frame, Y scores in the second column
  memory.df$Cond.X,           # grouping variable
  mat = TRUE,                 # matrix format
  digits = 2)                 # round values to 2 digits
Describe.df                   # show descriptive
##            item  group1 vars  n mean   sd median trimmed  mad min max range
## Correct.Y1    1   Cheat    1 40 9.10 0.96      9    9.22 1.48   7  10     3
## Correct.Y2    2 Neutral    1 40 7.85 0.83      8    7.81 1.48   6  10     4
## Correct.Y3    3   Trust    1 40 7.92 0.83      8    7.91 1.48   6  10     4
##             skew kurtosis   se
## Correct.Y1 -0.71    -0.61 0.15
## Correct.Y2  0.27    -0.30 0.13
## Correct.Y3  0.13    -0.29 0.13


Plot the means and SEMs across the cheat, neutral, and trustworthy conditions:

library(ggplot2)
ggplot(Describe.df, aes(x = group1, y = mean)) + # plot X and Y
  geom_col(                                         # bar graph
    width = 0.5,
    color = "black",
    fill = hsv(0.6, 0.4, 0.7)) +
  geom_errorbar(aes(ymin = mean-se, ymax = mean+se),# calculate error bar
                color = "black",                    # error bar color
                width = .1) +                       # error bar size
  scale_x_discrete(name = "Information Condition",  # reorder conditions
    limits = c("Neutral", "Trust", "Cheat")) +
  scale_y_continuous(name = "Mean Recall Score", limits = c(0, 10)) +
  ggtitle("Recall across information conditions") +
  theme_minimal()



4 Statistical significance test

  1. State the null and alternative hypotheses
    • \(H_0:\mu_{cheat} = \mu_{neutral} = \mu_{trust}\)
    • \(H_A:\mu_{cheat} \ne \mu_{neutral} \ne \mu_{trust}\)
  2. Define the critical region for the statistical decision
    • \(\alpha\) = .05
  3. Collect sample data and calculate statistics
    • F-statistic
    • Followup tests
  4. Make a statistical decision
    • If \(\text{p-value} \le .05\), reject the null, otherwise, fail to reject the null
    • Remember that p is the probability that the statistic value would be obtained if the null hypothesis is true
  5. Determine which, if any pairwise differences are significant
    • Use Tukey’s HSD to test all pairwise hypotheses, controlling for familywise error, and construct simultaneous confidence intervals
    • In some cases, Bonferroni, Scheffe, Games-Howell, or other tests may offer statistical advantages over Tukey’s HSD


4.1 Independent-samples F-statistic

The F-statistic is a test of mean equivalence based on the ratio of treatment effect (\(MS_A\)) to error (\(MS_E\)). In the R source table, these values appear in the “Mean Sq” column.


\(\displaystyle F = \frac{\text{Treatment Effect}}{\text{Error}} = \frac{MS_A}{MS_E} = \frac{\text{Mean Sq Cond.X}}{\text{Mean Sq Residual}}\)


Find the F-statistic using the aov() function:

aov.mem <- aov(Correct.Y ~ Cond.X, data = memory.df)  # outcome variable ~ predictor variable
summary(aov.mem)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Cond.X        2  39.32  19.658   25.71 5.57e-10 ***
## Residuals   117  89.48   0.765                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The ANOVA source table shows that \(\text{p} \le .05\) so:

  • Reject the null hypothesis, \(H_0:\mu_{cheat} = \mu_{neutral} = \mu_{trust}\)
  • Retain the alternative hypothesis, \(H_A:\mu_{cheat} \ne \mu_{neutral} \ne \mu_{trust}\)


4.2 Measure effect size with Cohen’s f

Cohen’s f is a standardized measure of effect size:


\(\displaystyle \hat{f} = \sqrt{\frac{(a - 1)(MS_A - MS_E)}{anMS_E}}\)


Find Cohen’s f:

# find Cohen's f
library(effectsize)
effectsize(aov.mem, type = "f")
## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.
## # Effect Size for ANOVA
## 
## Parameter | Cohen's f |      95% CI
## -----------------------------------
## Cond.X    |      0.66 | [0.49, Inf]
## 
## - One-sided CIs: upper bound fixed at [Inf].


4.3 Assumptions of the F-statistic

Researchers may want to check for extreme outliers and the 3 assumptions about the populations from which the sample data was collected.

  1. Independence: Scores in the population are independently distributed
  2. Normality: Populations have normally distributed scores
  3. Homogeneity of variance: Populations have equal variance


Check for outliers (“is.outlier”) and extreme outliers (“is.extreme”):

library(rstatix)
memory.df %>%
  group_by(Cond.X) %>%
  identify_outliers(Correct.Y)
## # A tibble: 2 × 4
##   Cond.X  Correct.Y is.outlier is.extreme
##   <fct>       <int> <lgl>      <lgl>     
## 1 Neutral        10 TRUE       FALSE     
## 2 Trust          10 TRUE       FALSE


Use the Shapiro-Wilk test to assess normality (p < .001 suggests the sample has been drawn from a non-normal population distribution):

library(dplyr)
memory.df %>%
   group_by(Cond.X) %>%
   summarise(statistic = shapiro.test(Correct.Y)$statistic,
             p.value = shapiro.test(Correct.Y)$p.value)
## # A tibble: 3 × 3
##   Cond.X  statistic   p.value
##   <fct>       <dbl>     <dbl>
## 1 Cheat       0.817 0.0000154
## 2 Neutral     0.877 0.000424 
## 3 Trust       0.881 0.000552


Use Levene’s test to assess homogeneity of variance (p < .05 suggests heterogeneous variance):

library(car)
leveneTest(Correct.Y ~ Cond.X, data = memory.df)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.0257 0.3618
##       117


4.4 Tukey’s HSD test

“Tukey’s honestly significant difference (HSD) test controls the familywise error for the set of all possible pairwise comparisons. It is a simultaneous method for testing hypotheses or constructing confidence intervals, meaning that a single critical value is used to evaluate all contrasts in a set” (Myers, Well, & Lorch, 2010, p. 252).

With some “follow-up tests” like Tukey’s HSD, a significant F-statistic is not a prerequisite (Myers, Well, & Lorch, 2010). For other comparison options, see the DescTools package.

Use tukey_hsd to run Tukey’s HSD test on recall scores (Correct.Y) between all possible pairs of conditions (Cond.X):

# Tukey's tests on all pairwise comparisons
library(rstatix)
q.mem <- memory.df %>% 
  tukey_hsd(Correct.Y ~ Cond.X)
q.mem
## # A tibble: 3 × 9
##   term   group1  group2  null.value estimate conf.low conf.high    p.adj p.adj…¹
## * <chr>  <chr>   <chr>        <dbl>    <dbl>    <dbl>     <dbl>    <dbl> <chr>  
## 1 Cond.X Cheat   Neutral          0  -1.25     -1.71     -0.786  1.04e-8 ****   
## 2 Cond.X Cheat   Trust            0  -1.18     -1.64     -0.711  6.49e-8 ****   
## 3 Cond.X Neutral Trust            0   0.0750   -0.389     0.539  9.22e-1 ns     
## # … with abbreviated variable name ¹​p.adj.signif


The output table shows the mean difference (“estimate”), confidence interval (“conf.low” “conf.high”), and p-value (“p.adj”) for each pairwise comparison. This combination of tests indicates that recall scores were significantly higher in the cheat condition than in the neutral or trust conditions:

  • Reject the null hypotheses, \(H_0:\mu_{cheat} = \mu_{neutral}\) and \(H_0:\mu_{cheat} = \mu_{trust}\)
  • Retain the alternative hypothesis, \(H_A:\mu_{cheat} \ne \mu_{neutral}\) and \(H_A:\mu_{cheat} \ne \mu_{trust}\)


4.5 Effect size

Cohen’s d for any pairwise comparison can be calculated by dividing the mean difference by the pooled standard deviation from the ANOVA model:


\(\displaystyle d = \frac{\bar{Y}_1 - \bar{Y}_2}{\sqrt{MS_E}}\)


Using following two-step method, calculate and save the d values in a new data frame, cohen.df, then use kable() to show the results in a table.

# New data frame with comparison names in the first column and d values in second column
cohen.df <- data.frame(
     compare = factor(rep(c("Cheat - Neutral", "Cheat - Trust", "Neutral - Trust"), each=1)), 
     d = c(1.25/(sqrt(0.765)), 1.18/(sqrt(0.765)), 0.075/(sqrt(0.765))))

# show comparison names and d values in a table
cohen.kbl <- kable(cohen.df,                    # makes a simple table
      format = "html",                          # .html format
      table.attr = "style='width:40%;'",        # css control for column width
      digits = 2,                               # round values to 2 decimal places
      caption = "Effect size (d)",              # table title
      col.names = c("Comparison","Cohen's d"))  # set column names
cohen.kbl                                       # show the table
Effect size (d)
Comparison Cohen’s d
Cheat - Neutral 1.43
Cheat - Trust 1.35
Neutral - Trust 0.09



5 Reporting the results

In an APA style paper, the results could be reported as follows:

An independent-samples F-test showed that recall scores significantly differed across the cheat, neutral, and trust information conditions, F(2,117) = 25.71, p < .001, \(\hat{f}\) = 0.66. Tukey’s tests showed that recall scores were significantly higher in the cheat (M = 9.10, SD = 0.96) condition than in the neutral (M = 7.85, SD = 0.83) (p < .001, d = 1.43) and trust (M = 7.92, SD = 0.83) (p < .001, d = 1.35) conditions.