1 Import data into RStudio

The partial data set below shows gender identity, typical morning wake-up time, and academic exam scores collected from 16 participants in a correlational research study. The researchers were interested in how academic performance (outcome variable) differs across gender and average wake-up time (predictor variables).

Participant	Gender (X)	Wake-up Time (X)	Academic Performance (Y)
1	Male	11	2.4
2	Male	9	3.6
3	Female	9	3.2
4	Male	11	2.2
5	Female	7	3.8

Data in this format (e.g., tables or spreadsheets) cannot be directly pasted into RStudio. The following sections cover a few ways to import data.

1.1 Import data via URL

The tutorial data sets are available online in .csv format, which you can import directly into RStudio using its URL (no need to download the file). This process creates a new data frame in RStudio, which is just another name for a “2D” data set where each column is one variable and each row is one case (e.g., one participant’s score on each variable).

Use the read.csv() function to “read” the data in RStudio.

# create a new data frame called "wakeup.df"
wakeup.df  <- read.csv("https://andrewebrandt.github.io/object/datasets/wakeup.csv", header = TRUE)
wakeup.df                                                                                     # show the data

##    Partic Gender WakeTime AcdPerform.Y
## 1       1   Male       11          2.4
## 2       2   Male        9          3.6
## 3       3 Female        9          3.2
## 4       4   Male       11          2.2
## 5       5 Female        7          3.8
## 6       6   Male       10          2.2
## 7       7 Female       10          2.9
## 8       8   Male        8          3.1
## 9       9 Female        9          3.5
## 10     10 Female       10          3.2
## 11     11 Female       11          3.1
## 12     12   Male        6          3.5
## 13     13   Male        8          3.1
## 14     14 Female        7          3.2
## 15     15 Female        9          2.4
## 16     16   Male        6          3.9

1.2 Examine a data frame

Important: After you create or import a new data frame, it will appear in the Environment panel of RStudio (upper right). Your data is not properly loaded if it does not appear in the Environment.

Data frame and variables in the environment panel

To examine a new data frame:

Click the blue button to the left of the data frame name to see information about each variable. Notice that R has assigned a variable type to each variable, for example, “Gender” is a character variable (chr), “WakeTime” is an integer variable (int), and “AcdPerform.Y” is a numerical variable (num).
Click the spreadsheet icon to the right of the data frame name to view the entire data frame in spreadsheet format.

1.3 Save a data frame

Use the write.csv() function to save the data frame to your computer, just be sure to change the file path (everything in parentheses) to a location on your computer.

# write.csv(wakeup.df,"D:/My Drive/RStudio Working Directory/wakeup.csv", row.names = FALSE)

1.4 Import saved data

If you already have a data set saved to a .csv file on your computer (like wakeup.csv that you saved to your computer in the previous section), you can use the “Import Dataset” option in the environment panel to import it; select “From text (base)”, then select the file, and click “Import”. The same “Import Dataset” feature also has built in functions for importing Excel, SPSS, SAS, and Stata files.

2 Visualize and describe relationships

2.1 Quantitative variables

Create a scatter plot with a best-fit regression line (and standard error) to visualize the relationship between wake-up time and academic performance. Base R has many plotting functions, but in most cases it’s easier to make a high quality plot using the ggplot2 package.

Start by making sure you have the ggplot2 package installed.

library(ggplot2)
ggplot(wakeup.df, aes(x = WakeTime, y = AcdPerform.Y)) +   # data frame, aesthetic(x,y)      
  geom_point(na.rm = TRUE) +                               # add points
  geom_smooth(na.rm = TRUE, method = "lm", formula = "y ~ x") + # add best fit line and standard error
  scale_x_continuous(name = "Wake-up Time") +              # format x axis
  scale_y_continuous(name = "Academic Performance") +      # format y axis
  theme_minimal() +                                        # remove shading
  theme(text = element_text(size = 16))                    # adjust text size

Use Pearson’s r statistic (correlation) to describe the strength and direction of linear relationship between wake-up time and academic performance. Notice that each variable is called by connecting the data frame name to the variable name using the dollar sign ($).

cor(wakeup.df$WakeTime, wakeup.df$AcdPerform.Y,    # data frame name $ variable name
    use = "complete.obs")                          # omit rows with missing data (NA)

## [1] -0.6948371

Create a descriptive statistics summary table for the quantitative variables, wake-up time and academic performance. The approach shown below uses the dplyr data manipulation language to connect (%>%) multiple functions and save the results in a table (data frame). The descriptive statistics are calculated using the describe() function in the psych package.

Start by making sure you have the dplyr and psych packages installed (see Tutorial 0 for help).

library(dplyr)                          # load dplyr package
library(psych)                          # load psych package
summary1.df <- wakeup.df %>%            # create a new df from an existing df
  select(WakeTime, AcdPerform.Y) %>%    # select variables
  describe(skew= FALSE)                 # apply the describe() function
summary1.df                             # show the results

##              vars  n mean   sd min  max range   se
## WakeTime        1 16 8.81 1.68 6.0 11.0   5.0 0.42
## AcdPerform.Y    2 16 3.08 0.54 2.2  3.9   1.7 0.13

2.2 Qualitative and quantitative variables

Create a strip chart to visualize the relationship between a qualitative, or categorical predictor (X) variable and quantitative outcome (Y) variable, like gender and academic performance. A strip chart displays the means (open circles) and standard errors (error bars), similar to a classic bar plot (see Appendix B), with the added benefit of showing all the raw scores in each category.

To reduce readability issues that can arise with too many overlapping data points, the points in this plot are slightly transparent (alpha) and the geom_jitter() function adds a small amount of horizontal offset within each category. See this article on Doing Better Data Visualization for more discussion (Advances in Methods and Practices in Psychological Science, 2021).

library(ggplot2)
ggplot(wakeup.df, aes(x = Gender, y = AcdPerform.Y, color = Gender)) +             # data frame, aesthetic 
  geom_jitter(na.rm = TRUE, position = position_jitter(0.05), size = 2, alpha = 0.7) +   # add data points
  stat_summary(fun = mean, na.rm = TRUE, geom = "point", shape = 1, size = 2, color = "red") + # add mean
  stat_summary(fun.data = mean_se, na.rm = TRUE, geom = "errorbar", width = .1, color = "red") + # add error bars
  scale_x_discrete(name = "Gender") +                                              # x axis label
  scale_y_continuous(name = "Academic Performance") +                              # y axis label
  theme_minimal() +                                                                # remove shading
  theme(text = element_text(size = 16), legend.position = "none") +    # adjust text size and remove legend
  scale_color_brewer(palette="Dark2")                                  # vary data point color by group

A classic box plot is another useful tool for examining a single variable or the relationship between a categorical predictor (X) and quantitative outcome (Y) variable. The figure below the box plot shows how to interpret each feature.

library(ggplot2)
ggplot(wakeup.df, aes(x = Gender, y = AcdPerform.Y, color = Gender)) + # color = grouping variable
  geom_boxplot() +                                                     # add box
  scale_y_continuous(name = "Academic Performance") +                  # format y axis
  theme_minimal() +                                                    # remove shading
  theme(text = element_text(size = 16),legend.position="none")  +      # adjust text size and remove legend
  scale_color_brewer(palette="Dark2") +                                # vary box color by group
  coord_flip()                                                         # flip x and y coordinates

Features of a box plot

Create a descriptive statistics summary table by category to see how a quantitative outcome (Y) variable, like academic performance, varies across levels of a categorical predictor (X) variable, like gender. In the approach shown below, the descriptive statistics are calculated using the describeBy() function in the psych package.

library(psych)                                   # load psych package
summary2.df <- describeBy(AcdPerform.Y ~ Gender, # quantitative variable ~ categorical variable
    data = wakeup.df, mat = TRUE, skew = FALSE)  # data frame, matrix format, omit skew and kurtosis
summary2.df                                      # show the results

##               item group1 vars n   mean       sd min max range        se
## AcdPerform.Y1    1 Female    1 8 3.1625 0.410357 2.4 3.8   1.4 0.1450831
## AcdPerform.Y2    2   Male    1 8 3.0000 0.663325 2.2 3.9   1.7 0.2345208

2.3 Reporting the results in APA-style

The following statements show how the results may be reported in an APA-style report.

The average participant wake-up time was approximately 8:45 am (M = 8.81, SD = 1.68) and the mean academic performance was just above a B (M = 3.08, SD = 0.54). The scatter plot shows that later wake-up times were associated lower academic performance, r = -0.695.

The strip chart shows that academic performance among female participants was somewhat higher on average and less variable (M = 3.16, SD = 0.41) than among male participants (M = 3.00, SD = 0.66).

3 Appendicies

3.1 Appendix A: Create a df

With manual entry, start by choosing a new variable name (Partic, Gender, WakeTime, AcdPerform.Y) and “combine” all the scores for that variable in order using c():

# create 4 separate variables; assign data in order within each variable
Partic <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
Gender <- c("Male", "Male", "Female",   "Male", "Female",   "Male", "Female",   "Male", "Female",   "Female",   "Female",   "Male", "Male", "Female",   "Female",   "Male")
WakeTime <- c(11, 9, 9, 11, 7, 10, 10, 8, 9, 10, 11, 6, 8, 7, 9, 6)
AcdPerform.Y <- c(2.4, 3.6, 3.2, 2.2, 3.8, 2.2, 2.9, 3.1, 3.5, 3.2, 3.1, 3.5, 3.1, 3.2, 2.4, 3.9)

# now combine the four separate variables into one data frame
wakeup.df <- data.frame(Partic, Gender, WakeTime, AcdPerform.Y)

# show the data frame in the console
wakeup.df

##    Partic Gender WakeTime AcdPerform.Y
## 1       1   Male       11          2.4
## 2       2   Male        9          3.6
## 3       3 Female        9          3.2
## 4       4   Male       11          2.2
## 5       5 Female        7          3.8
## 6       6   Male       10          2.2
## 7       7 Female       10          2.9
## 8       8   Male        8          3.1
## 9       9 Female        9          3.5
## 10     10 Female       10          3.2
## 11     11 Female       11          3.1
## 12     12   Male        6          3.5
## 13     13   Male        8          3.1
## 14     14 Female        7          3.2
## 15     15 Female        9          2.4
## 16     16   Male        6          3.9

3.2 Appendix B: Bar plots

Create a classic bar plot using the mean and standard error from a descriptive summary table (summary2.df), not the raw data (wakeup.df). Examine summary2.df in the “Environment” panel to see that group1 is the categorical variable with labels “Female” and “Male”, the means are in the mean variable, and the standard errors are in the se variable.

library(ggplot2)
ggplot(summary2.df, aes(x = group1, y = mean)) +             # data frame, aesthetic
  geom_col(width = 0.5, fill = "steelblue") +                # add bars, adjust width and fill color
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se),     # add error bars, aes(negative se,positive se)
                width = .1) +                                # error bar width
  scale_x_discrete(name = "Gender") +                        # format x axis
  scale_y_continuous(name = "Mean Academic Performance") +   # format y axis
  theme_minimal() +                                          # remove shading
  theme(text = element_text(size = 16))                      # adjust text size

Data Frames, Descriptive Statistics, and Data Visualization

Andrew Brandt, PhD

Experimental Psychologist

User Experience Researcher