1 Normal distributions

The simulations in this tutorial start with the set.seed() function, which sets the seed of R’s random number generator. Doing this ensures that you get the exact same distribution of pseudo-random scores each time you call a random process, like rnorm(), dnorm(), or sample(). If you wish to generate a different distribution of scores, you can change the seed number or omit the set.seed() command.

1.1 Simulate a normal distribution

The scores obtained from any intelligence test are standardized so we know that in the population \(\mu\) = 100 and \(\sigma\) = 15. Simulate a normal population distribution of N = 100,000 IQ scores with a mean = 100 and standard deviation = 15.

set.seed(3)                                       # set seed
Y <- round(rnorm(100000, mean = 100, sd = 15),0)  # generate whole random numbers; save scores in variable Y
IQ.df <- as.data.frame(Y)                         # save scores in new data frame
head(IQ.df)                                       # show the top of the data frame

##     Y
## 1  86
## 2  96
## 3 104
## 4  83
## 5 103
## 6 100

Create a descriptive statistics summary table for the population IQ scores using the describe() function in the psych package.

library(psych)
describe(IQ.df, skew = FALSE)

##    vars     n   mean    sd min max range   se
## X1    1 1e+05 100.01 15.06  28 166   138 0.05

1.2 Plot the frequency distribution

Plot the population IQ scores in a frequency distribution graph (histogram).

library(ggplot2)
ggplot(IQ.df, aes(x = Y)) +  # data frame, x = scores to plot
  geom_histogram(
    bins = 200,             # bins on X-Axis
    color = "darkblue",     # line color
    fill = "white") +       # fill color
  labs(x = "IQ score",      # X-axis label
       y = "Frequency")     # Y-axis label

Appendix A shows how to simulate normal distributions for multiple groups with different means and SDs, e.g., female and male IQ scores.

2 Null hypothesis significance testing

Research summary: In a hypothetical research study, 36 normal human adult participants volunteered to receive once-a-week brain stimulation therapy that had been shown to safely increase cognitive functioning in several non-human animal species, including primates. After 12 therapy sessions, all participants completed a standardized IQ test administered by a licensed psychometrician.

2.1 Purpose

Null hypothesis significance testing (NHST) is a framework, or set of rules, for using sample data to make an inference about unknown population data.

Inference: Drawing a conclusion or making a logical judgment on the basis of evidence and prior conclusions rather than on the basis of direct observation.

Although we can accurately simulate population scores for many psychological variables of interest, like IQ, in most cases we are not able to directly observe all the actual scores in a population. Instead, psychological researchers sample scores from the population of interest and apply the NHST rules to make an inference about the unknown population scores.

2.2 NHST framework

State the null hypothesis: People in the population who receive 12 weeks of brain stimulation therapy will have a mean IQ = 100 (i.e., the therapy will not affect IQ). This idea can be expressed in two ways:
- \(H_0:\mu_{therapy} = 100\)
- \(H_0:\mu_{NoTherapy} - \mu_{Therapy} = 0\)
Define the critical region for the statistical decision: \(\alpha\) = .05
Collect sample data and calculate statistics
Make a statistical decision: Reject or fail to reject the null hypothesis

2.3 Simulate and plot sample data

Instead of starting with a hypothetical data set, let’s randomly select a sample of n = 36 IQ scores from the N = 100,000 untreated population we created above (IQ.df). Doing so assumes there really is no effect of the therapy, but it’s also a way to demonstrate sampling error, which is a discrepancy between a sample and the population it was drawn from.

set.seed(3)                                            # set seed
Ysample <- sample(IQ.df$Y, size = 36, replace = FALSE) # randomly select 36 scores from IQ.df 
IQsample.df <- as.data.frame(Ysample)                  # save the 36 scores in a new data frame
head(IQsample.df)                                      # show the top of the data frame

##   Ysample
## 1      86
## 2     102
## 3     119
## 4     120
## 5     114
## 6      91

Create a descriptive statistics summary table for the IQ scores using the describe() function in the psych package. Note how the IQ scores compare to the population IQ mean = 100 and SD = 15.

library(psych)
describe(IQsample.df)

##    vars  n   mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 36 101.86 16.6    101   101.7 20.02  72 141    69 0.14    -0.81 2.77

Plot the sample IQ scores in a frequency distribution.

library(ggplot2)
ggplot(IQsample.df, aes(x = Ysample)) + 
  geom_histogram(bins = 10, color = "darkblue", fill = "white") + 
  labs(x = "IQ", y = "Frequency")

2.4 Calculate the t-statistic and make a statistical decision

Use the t.test() function to run the one-sample t-test, which also produces a p-value and confidence intervals.

testIQ <- t.test(IQsample.df$Ysample, mu = 100)  # one-sample t-test, mu = population mean
testIQ                                           # show the results

## 
##  One Sample t-test
## 
## data:  IQsample.df$Ysample
## t = 0.67263, df = 35, p-value = 0.5056
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
##   96.24398 107.47824
## sample estimates:
## mean of x 
##  101.8611

Statistical decision rule for the t-statistic:

\(\text{If } p \le \text{.05, reject the null hypothesis, otherwise, fail to reject the null}\)

For t(35) = 0.67, the p-value = .5056, which is greater than \(\alpha\) = .05, so we fail to reject the null hypothesis. In other words, the sample scores did not differ significantly from the untreated population.

2.5 Reporting the results in APA-style

The following statement shows how the results may be reported in an APA-style report.

Among a sample of 36 normal adult participants, 12 weeks of brain stimulation therapy showed no significant effect on IQ (M = 101.86, SD = 16.6) compared to the general population, t(35) = 0.67, p = .51, 95% CI [96.24, 107.48].

3 Understanding t and the statistical decision

If we selected all possible samples with n = 36, the mean IQ from each of those samples would form the sampling distribution of the mean (see the Monte Carlo simulation in Appendix B). The mean of this sampling distribution will equal the population mean and its standard deviation is the standard error of the mean (SEM). The SEM indicates the average difference between the population mean and a sample mean when n = 36.

\(\displaystyle \sigma_\bar{Y} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{36}} = 2.5\)

Appendix B shows a Monte Carlo simulation that can be used to test how changes in sample size the number of samples selected affects the standard error of the mean.

The t-statistic is a ratio between an obtained mean difference and an estimate of the SEM based on the sample scores, in other words, it indicates how far a sample mean falls from the population mean in terms of the estimated SEM (\(s_{\bar{Y}}\)). For example, a t-statistic = +1 indicates that the sample mean falls exactly one SEM above the population mean.

\(\displaystyle t = \frac{\bar{Y}-\mu}{s_{\bar{Y}}} \qquad s_{\bar{Y}} = \frac{s}{\sqrt{n}}\)

Entering the sample standard deviation = 16.6 and sample size = 36 into the estimated SEM (\(s_{\bar{Y}}\)) formula shows:

SEM <- round(16.6/sqrt(36),3)
SEM

## [1] 2.767

Entering the mean difference = 1.8611 and estimated SEM = 2.767 into the t formula shows:

t <- round((101.8611 - 100)/SEM, 3) 
t

## [1] 0.673

\(\displaystyle t = \frac{101.8611 - 100}{2.767} = \frac{1.8611}{2.767} = 0.673\)

So, the t calculation shows that the sample mean = 101.8611 IQ points is approximately 2/3s of one standard error above the population mean = 100.

For a t-test with 35 degrees of freedom, the mean difference must exceed 2.03 standard errors to fall in the 2.5% critical regions below or above the population mean. Because t = 0.673 falls well short of the positive critical region (red), we fail to reject the null hypothesis: The therapy has no effect on IQ

4 Appendix A: Simulate normal distributions by group

There is a long standing claim that male and female IQ scores have the same mean but differ in variability. Simulate a distribution of IQ scores for males and females, N = 10,000 each, that is consistent with this claim.

set.seed(123)
IQg.df <- data.frame(
  gender = factor(rep(c("Female", "Male"), each = 10000)),   # creates a categorical variable with labels
  Y = round(c(rnorm(10000, mean = 100, sd = 13), rnorm(10000, mean = 100, sd = 17)),0))
head(IQg.df)

##   gender   Y
## 1 Female  93
## 2 Female  97
## 3 Female 120
## 4 Female 101
## 5 Female 102
## 6 Female 122

Create a descriptive statistics summary table by category using the describeBy() function in the psych package to compare the male and female IQ distributions.

library(psych)                                            # load psych package
summary.df <- describeBy(Y ~ gender,                      # quantitative variable ~ categorical variable
    data = IQg.df, mat = TRUE, skew = FALSE, digits = 2)  # data frame, matrix format, omit skew and kurtosis, round
summary.df                                                # show the results

##     item group1 vars     n  mean    sd min max range   se
## X11    1 Female    1 10000 99.97 12.98  50 150   100 0.13
## X12    2   Male    1 10000 99.84 17.03  41 164   123 0.17

The male and female distributions can be plotted separately, but it’s easier to visually compare two distributions when they appear in the same plot along with a visual indication of their means and SDs. To make this more visually appealing, we will use density plots instead of histograms. In very simplistic terms, a density plot is a smoothed estimate of a frequency distribution.

ggplot(IQg.df) + 
  geom_density(aes(x = Y, color = gender, fill = gender), alpha = 0.4) +                            # plot density 
  scale_color_manual(values=c(hsv(0.5, 0.5, 0.5), hsv(0.7, 0.7, 0.7))) +                            # line color
  scale_fill_manual(values=c(hsv(0.5, 0.5, 0.5), hsv(0.7, 0.7, 0.7))) +                             # fill color
  geom_vline(data = summary.df, aes(xintercept = mean, color = group1), linetype = "dashed") +      # mean lines
  geom_vline(data = summary.df, aes(xintercept = mean + sd, color = group1), linetype = "dotted") + # SD + line
  geom_vline(data = summary.df, aes(xintercept = mean - sd, color = group1), linetype = "dotted") + # SD - line
  labs(x = "IQ score", y = "Density")                                                               # axis labels

5 Appendix B: Monte Carlo simulation for a sampling distribution of the mean

This Monte Carlo simulation is designed to randomly select samples with size n from a normally distributed population (the normal distribution, \(\mu\) = 0, \(\sigma\) = 1). In a real behavioral research situation, the population of interest may be IQ scores, neuroticism scores, or helpfulness scores, but assuming the scores are normally distributed, they will look like this:

curve(dnorm(x),
      xlim = c(-4, 4),
      main = "Hypothetical Population of Raw Scores (Y)",
      xlab = "z",
      ylab = "density",
      bty = "n", # removes borders
      lwd = 2)

Simulate a sampling distribution of the mean

A sampling distribution of the mean is produced by randomly selecting all the possible samples of size n from a population and calculating the sample mean for each. With the large populations and small samples that are typical in behavioral research, the number of repetitions needed to collect “all the possible samples” is very large, but approximating this process with a smaller population and far fewer repetitions will still reveal the important connection between sample size and sampling error.

The following Monte Carlo Simulation is “self-contained”, meaning you should be able to clear everything from your R work space, copy and paste the code into the console, and hit enter. The code for this and many other simulations can be found at econometrics-with-r.org.

# set sample size and number of samples
n <- 100
reps <- 1000
 
# perform random sampling; 100 x 1,000 sample matrix
samples <- replicate(reps, rnorm(n))

# compute sample means
sample.avgs <- colMeans(samples)

#Plot the density histogram
hist(sample.avgs, 
    xlim = c(-0.3,0.3),
    freq = FALSE, 
    xlab = paste("n = ", n, "   SEM = ", 1/sqrt(n)))
 
# overlay the theoretical distribution of sample averages on top of the histogram
curve(dnorm(x, sd = 1/sqrt(n)), 
     col = "red", 
     lwd = "2", 
     add = TRUE)

The first two lines of the code set the sample size n = 100 and the number of samples to be selected, 1,000. The third and fourth lines organize the scores into a matrix (one sample per column) and calculate the mean for each sample, \(\bar{Y}\). By inspecting the samples matrix in the “Environment” tab, you can confirm that each of the 1,000 samples is unique.

The last code segment plots the red line, which is the predicted sample mean distribution with a standard deviation equal to the standard error of the mean, sd = 1/sqrt(n), which is \(\pm\) 0.1 when n = 100. A simple visual inspection of the plots below will show that well over half (actually, a bit over 2/3) of the sample means fall within \(\pm\) 1 standard error of \(\mu\) = 0.

\(\displaystyle \sigma_\bar{Y} = \frac{\sigma}{\sqrt n}\)

To see how larger sample sizes reduce sampling error, or the discrepancy between \(\mu\) and \(\bar{Y}\), simply rerun the simulation over and over with different sample sizes. Below are sampling distributions of the mean where n = 100, n = 1,000 and n = 10,000. Remember that the hypothetical population from which the samples are drawn had \(\mu\) = 0, so sample means that fall above 0 are overestimates of the population mean and sample means that fall below 0 are underestimates.

Notice that the X-axis scale has been limited from -0.3 to +0.3 to show how the sample means get more and more concentrated around \(\mu\) = 0 with larger sample sizes. This concentration means that many more sample means are pilled up in this location, and as a result, the Y-axis scale gets larger with each increase in n even though each plot contains 1,000 sample mean scores. Another way to make the comparison is to overlay the first two prediction lines on the last plot (n = 100 is green and n = 1,000 is blue). Here, you can see how the 1,000 means are being squeezed into closer and closer proximity to \(\mu\) = 0, which is what you’d expect to see as your sample means become better and better estimates of \(\mu\).

Simulated Distributions and Hypothesis Testing

Andrew Brandt, PhD

Experimental Psychologist

User Experience Researcher