The partial data set below shows gender identity, typical morning wake-up time, and academic exam scores collected from 16 participants in a correlational research study. The researchers were interested in how academic performance (outcome variable) differs across gender and average wake-up time (predictor variables).
Participant | Gender (X) | Wake-up Time (X) | Academic Performance (Y) |
---|---|---|---|
1 | Male | 11 | 2.4 |
2 | Male | 9 | 3.6 |
3 | Female | 9 | 3.2 |
4 | Male | 11 | 2.2 |
5 | Female | 7 | 3.8 |
Data in this format (e.g., tables or spreadsheets) cannot be directly pasted into RStudio. The following sections cover a few ways to import data.
The tutorial data sets are available online in .csv format, which you can import directly into RStudio using its URL (no need to download the file). This process creates a new data frame in RStudio, which is just another name for a “2D” data set where each column is one variable and each row is one case (e.g., one participant’s score on each variable).
Use the read.csv()
function to “read” the data in
RStudio.
# create a new data frame called "wakeup.df"
<- read.csv("https://andrewebrandt.github.io/object/datasets/wakeup.csv", header = TRUE)
wakeup.df # show the data wakeup.df
## Partic Gender WakeTime AcdPerform.Y
## 1 1 Male 11 2.4
## 2 2 Male 9 3.6
## 3 3 Female 9 3.2
## 4 4 Male 11 2.2
## 5 5 Female 7 3.8
## 6 6 Male 10 2.2
## 7 7 Female 10 2.9
## 8 8 Male 8 3.1
## 9 9 Female 9 3.5
## 10 10 Female 10 3.2
## 11 11 Female 11 3.1
## 12 12 Male 6 3.5
## 13 13 Male 8 3.1
## 14 14 Female 7 3.2
## 15 15 Female 9 2.4
## 16 16 Male 6 3.9
Important: After you create or import a new data frame, it will appear in the Environment panel of RStudio (upper right). Your data is not properly loaded if it does not appear in the Environment.
To examine a new data frame:
Use the write.csv()
function to save the data frame to
your computer, just be sure to change the file path (everything in
parentheses) to a location on your computer.
# write.csv(wakeup.df,"D:/My Drive/RStudio Working Directory/wakeup.csv", row.names = FALSE)
If you already have a data set saved to a .csv file on your computer (like wakeup.csv that you saved to your computer in the previous section), you can use the “Import Dataset” option in the environment panel to import it; select “From text (base)”, then select the file, and click “Import”. The same “Import Dataset” feature also has built in functions for importing Excel, SPSS, SAS, and Stata files.
Create a scatter plot with a best-fit regression line (and standard error) to visualize the relationship between wake-up time and academic performance. Base R has many plotting functions, but in most cases it’s easier to make a high quality plot using the ggplot2 package.
Start by making sure you have the ggplot2 package installed.
library(ggplot2)
ggplot(wakeup.df, aes(x = WakeTime, y = AcdPerform.Y)) + # data frame, aesthetic(x,y)
geom_point(na.rm = TRUE) + # add points
geom_smooth(na.rm = TRUE, method = "lm", formula = "y ~ x") + # add best fit line and standard error
scale_x_continuous(name = "Wake-up Time") + # format x axis
scale_y_continuous(name = "Academic Performance") + # format y axis
theme_minimal() + # remove shading
theme(text = element_text(size = 16)) # adjust text size
Use Pearson’s r statistic (correlation) to describe the strength and direction of linear relationship between wake-up time and academic performance. Notice that each variable is called by connecting the data frame name to the variable name using the dollar sign ($).
cor(wakeup.df$WakeTime, wakeup.df$AcdPerform.Y, # data frame name $ variable name
use = "complete.obs") # omit rows with missing data (NA)
## [1] -0.6948371
Create a descriptive statistics summary table for
the quantitative variables, wake-up time and academic performance. The
approach shown below uses the dplyr data manipulation language
to connect (%>%) multiple functions and save the results in a table
(data frame). The descriptive statistics are calculated using the
describe()
function in the psych package.
Start by making sure you have the dplyr and psych packages installed (see Tutorial 0 for help).
library(dplyr) # load dplyr package
library(psych) # load psych package
<- wakeup.df %>% # create a new df from an existing df
summary1.df select(WakeTime, AcdPerform.Y) %>% # select variables
describe(skew= FALSE) # apply the describe() function
# show the results summary1.df
## vars n mean sd min max range se
## WakeTime 1 16 8.81 1.68 6.0 11.0 5.0 0.42
## AcdPerform.Y 2 16 3.08 0.54 2.2 3.9 1.7 0.13
Create a strip chart to visualize the relationship between a qualitative, or categorical predictor (X) variable and quantitative outcome (Y) variable, like gender and academic performance. A strip chart displays the means (open circles) and standard errors (error bars), similar to a classic bar plot (see Appendix B), with the added benefit of showing all the raw scores in each category.
To reduce readability issues that can arise with too many overlapping
data points, the points in this plot are slightly transparent (alpha)
and the geom_jitter()
function adds a small amount of
horizontal offset within each category. See this article on Doing
Better Data Visualization for more discussion (Advances in Methods
and Practices in Psychological Science, 2021).
library(ggplot2)
ggplot(wakeup.df, aes(x = Gender, y = AcdPerform.Y, color = Gender)) + # data frame, aesthetic
geom_jitter(na.rm = TRUE, position = position_jitter(0.05), size = 2, alpha = 0.7) + # add data points
stat_summary(fun = mean, na.rm = TRUE, geom = "point", shape = 1, size = 2, color = "red") + # add mean
stat_summary(fun.data = mean_se, na.rm = TRUE, geom = "errorbar", width = .1, color = "red") + # add error bars
scale_x_discrete(name = "Gender") + # x axis label
scale_y_continuous(name = "Academic Performance") + # y axis label
theme_minimal() + # remove shading
theme(text = element_text(size = 16), legend.position = "none") + # adjust text size and remove legend
scale_color_brewer(palette="Dark2") # vary data point color by group
A classic box plot is another useful tool for examining a single variable or the relationship between a categorical predictor (X) and quantitative outcome (Y) variable. The figure below the box plot shows how to interpret each feature.
library(ggplot2)
ggplot(wakeup.df, aes(x = Gender, y = AcdPerform.Y, color = Gender)) + # color = grouping variable
geom_boxplot() + # add box
scale_y_continuous(name = "Academic Performance") + # format y axis
theme_minimal() + # remove shading
theme(text = element_text(size = 16),legend.position="none") + # adjust text size and remove legend
scale_color_brewer(palette="Dark2") + # vary box color by group
coord_flip() # flip x and y coordinates
Create a descriptive statistics summary table by
category to see how a quantitative outcome (Y) variable, like
academic performance, varies across levels of a categorical predictor
(X) variable, like gender. In the approach shown below, the descriptive
statistics are calculated using the describeBy()
function
in the psych package.
library(psych) # load psych package
<- describeBy(AcdPerform.Y ~ Gender, # quantitative variable ~ categorical variable
summary2.df data = wakeup.df, mat = TRUE, skew = FALSE) # data frame, matrix format, omit skew and kurtosis
# show the results summary2.df
## item group1 vars n mean sd min max range se
## AcdPerform.Y1 1 Female 1 8 3.1625 0.410357 2.4 3.8 1.4 0.1450831
## AcdPerform.Y2 2 Male 1 8 3.0000 0.663325 2.2 3.9 1.7 0.2345208
The following statements show how the results may be reported in an APA-style report.
The average
participant wake-up time was approximately 8:45 am (M = 8.81,
SD = 1.68) and the mean academic performance was just above a B
(M = 3.08, SD = 0.54). The scatter plot shows that
later wake-up times were associated lower academic performance,
r = -0.695.
The strip chart shows that academic
performance among female participants was somewhat higher on average and
less variable (M = 3.16, SD = 0.41) than among male
participants (M = 3.00, SD = 0.66).
With manual entry, start by choosing a new variable name (Partic,
Gender, WakeTime, AcdPerform.Y) and “combine” all the scores for that
variable in order using c()
:
# create 4 separate variables; assign data in order within each variable
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
Partic <- c("Male", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Female", "Male")
Gender <- c(11, 9, 9, 11, 7, 10, 10, 8, 9, 10, 11, 6, 8, 7, 9, 6)
WakeTime <- c(2.4, 3.6, 3.2, 2.2, 3.8, 2.2, 2.9, 3.1, 3.5, 3.2, 3.1, 3.5, 3.1, 3.2, 2.4, 3.9)
AcdPerform.Y
# now combine the four separate variables into one data frame
<- data.frame(Partic, Gender, WakeTime, AcdPerform.Y)
wakeup.df
# show the data frame in the console
wakeup.df
## Partic Gender WakeTime AcdPerform.Y
## 1 1 Male 11 2.4
## 2 2 Male 9 3.6
## 3 3 Female 9 3.2
## 4 4 Male 11 2.2
## 5 5 Female 7 3.8
## 6 6 Male 10 2.2
## 7 7 Female 10 2.9
## 8 8 Male 8 3.1
## 9 9 Female 9 3.5
## 10 10 Female 10 3.2
## 11 11 Female 11 3.1
## 12 12 Male 6 3.5
## 13 13 Male 8 3.1
## 14 14 Female 7 3.2
## 15 15 Female 9 2.4
## 16 16 Male 6 3.9
Create a classic bar plot using the mean and standard error from a descriptive summary table (summary2.df), not the raw data (wakeup.df). Examine summary2.df in the “Environment” panel to see that group1 is the categorical variable with labels “Female” and “Male”, the means are in the mean variable, and the standard errors are in the se variable.
library(ggplot2)
ggplot(summary2.df, aes(x = group1, y = mean)) + # data frame, aesthetic
geom_col(width = 0.5, fill = "steelblue") + # add bars, adjust width and fill color
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), # add error bars, aes(negative se,positive se)
width = .1) + # error bar width
scale_x_discrete(name = "Gender") + # format x axis
scale_y_continuous(name = "Mean Academic Performance") + # format y axis
theme_minimal() + # remove shading
theme(text = element_text(size = 16)) # adjust text size