1 Import Your Data

The scores in the LifeSatis.csv data file below come from a correlational research study conducted by students in a behavioral research methods course. Participants were introductory psychology students (n = 62) who volunteered to complete an online survey.

Outcome and Predictor Variables: The researchers were interested in developing a model of life satisfaction (outcome variable) based on participants’ sleep, attachment to their phone, and social media use (predictor variables). See Appendix A for the full text of each survey item.

Hypotheses: Higher life satisfaction was expected to be associated with getting enough sleep, being comfortable without a phone, and having less engagement with social media.


1.1 Import and save a new data frame

After data collection for this study had ended, the participant scores were exported from Qualtrics to comma-separated text file (.csv). Whether the data are saved on your computer, on a shared drive, or a server, use the read.csv() function to copy the data and save the values in a new data frame.

In this command, a new data frame named LifeSat.df is created from the data in the LifeSatis.csv file located on a server.

LifeSat.df  <- read.csv("https://andrewebrandt.github.io/object/datasets/LifeSatis.csv", header = TRUE)
head(LifeSat.df, 10)
##     P Gender LS1 LS2 LS3 LS4 LS5 PhoneWO SleepEn SocialMedia
## 1   1      2   6   6   6   5   6       4       4           2
## 2   2      2   2   1   2   2   2       2       1           3
## 3   3      1   7   7   7   7   7       5       4           1
## 4   4      1   7   5   7   6   7       5      NA           2
## 5   5      2   6   6   6   6   6       4       4           2
## 6   6      1   5   4   6   4   2       5       2           4
## 7   7      1   3   4   2   2   2       3       2           2
## 8   8      1   5   5   5   5   3       2       2           2
## 9   9      1   5   6   5   6   6       2       4           1
## 10 10      2   7   7   7   7   6       4       2           1


Use the same read.csv() function to import .csv data saved on your computer, just be sure to replace the URL with the correct file path on your machine. Alternatively, use the “Import Dataset” option in the Environment panel in RStudio (use “From text (base)” for .csv files).

Use the write.csv() function to save a data frame to your computer as a .csv file, just be sure to change the file path to a location on your machine.

# write.csv(LifeSat.df,"D:/My Drive/RStudio Working Directory/LifeSat.csv", row.names = FALSE)


1.2 Examine the new data frame

Look at the first 10 rows of the LifeSat.df data frame above to see that the data is in a simple “2D” format where:

  1. Each column is one variable
  2. Each row is one observation, i.e., one participant’s scores on each variable
  3. Each cell (intersection of one column and one row) contains one value or no value (NA)

Next, carefully inspect the new data frame in the Environment panel in RStudio. Be sure you can identify all the variables in the data frame.

In some cases, you may want to create a smaller data frame with only a subset of the variables (see Appendix B) or rename one or more variables to make them easier to recognize (see Appendix C).


2 Wrangle and Explore Your Data




The LifeSat data set is relatively small and simple compared to many data sets you will encounter as a researcher, still, there are a small number of changes that are needed to transform this raw data into a tidy, usable form, a process called data wrangling. It is very important that you complete this process before you attempt to explore, visualize, or analyze the data.

Depending on the data set, data wrangling may be simple and fast, or it may present many challenges and require a lot of your time. This tutorial is designed to cover a small number of basic wrangling processes, but there are many more functions in the Tidy Data family that you will likely need to learn as you develop your analysis skills. Tidy Data Cheatsheet


2.1 Calculate scale scores

For any psychological scale with multiple items, like the life satisfaction scale (LS1 through LS5), you will need to apply scoring rules to generate a full scale score or subscale scores. For most small scales, you can simply find the mean score of all the items.

Use the dplyr package to mutate() a new variable for the mean life satisfaction score (average of LS1 through LS5):

library(dplyr)
LifeSat.df <- LifeSat.df %>%
  mutate(LSmean.Y = rowMeans(select(LifeSat.df, LS1:LS5), na.rm = TRUE))
head(LifeSat.df$LSmean.Y)
## [1] 5.8 1.8 7.0 6.4 6.0 4.2


Note, the na.rm = TRUE command tells the rowMeans() function to remove NA (empty) cells from the mean calculation on each row of LS scores. In other words, the command will calculate a life satisfaction mean for each participant even if a participant skipped some LS items. If you wish to find the mean life satisfaction score only for those participants who completed all five LS items, change TRUE to FALSE.


2.2 Reverse score

Some psychological scales are designed with a mix of forward and reverse scored items, and one set will need to be reverse scored before being combined with others. In other cases, a researcher may inadvertently scale an individual item in the opposite direction as other items, in which case, they may choose to reverse score the item prior to any analysis to reduce the chance of confusion (just be sure to reverse the scale text as well).

Review the full survey items in Appendix A and notice that all but one Likert items were scaled in the same direction, where higher scores indicated stronger agreement with the statement (Strongly Agree = 5). Alternatively, higher scores on the social media use item indicated stronger disagreement with the statement (Strongly Disagree = 5).

Use mutate() to create a new variable for reversed social media scores.

library(dplyr)
LifeSat.df <-                                          # update existing data frame
  mutate(LifeSat.df, SocialMediaRev =                  # create a new variable, SocialMediaRev
                     case_when(SocialMedia == 1 ~ 5,   # old score ~ new score
                               SocialMedia == 2 ~ 4,
                               SocialMedia == 3 ~ 3,
                               SocialMedia == 4 ~ 2,
                               SocialMedia == 5 ~ 1))


2.3 Rename qualitative (categorical) scores

Examine the Gender variable in the “Environment” tab and notice that it was imported as in integer type, rather than categorical type variable due to the numeric coding of each level.

Use mutate() to change the Gender variable to a categorical type, and replace the numerical values with category names.

library(dplyr)
LifeSat.df <- LifeSat.df %>%                     # update data set
     mutate(Gender = as.character(Gender)) %>%   # change Gender from integer to character
     mutate(Gender = case_when(                  # change condition labels
            Gender == "1" ~ "Female",    
            Gender == "2" ~ "Male",
            Gender == "3" ~ "NonBinary",
            Gender == "4" ~ "Other"))


2.4 Descriptive summary tables

Now that the data have been wrangled into a more usable form, you can examine and summarize the scores within each variable and assess the relationships between variables.

Create a descriptive summary table for one or more quantitative variables. This example introduces the psych package, which contains the describe() and describeBy() functions for finding descriptive statistics.

library(dplyr)                                              # load dplyr package
library(psych)                                              # load psych package
summary.quant <- LifeSat.df %>%                             # create a new df from an existing df
  select(PhoneWO, SleepEn, SocialMediaRev, LSmean.Y) %>%    # select variables
  describe(skew= FALSE)                                     # apply the describe() function
summary.quant                                               # show the results
##                vars  n mean   sd min max range   se
## PhoneWO           1 61 3.18 1.19   1   5     4 0.15
## SleepEn           2 61 2.46 1.06   1   4     3 0.14
## SocialMediaRev    3 61 3.69 1.12   1   5     4 0.14
## LSmean.Y          4 62 4.10 1.50   1   7     6 0.19


Create a descriptive summary table for the outcome variable scores (mean life satisfaction) across levels of a categorical predictor variable (gender).

library(psych)
summary.gender <- describeBy(
  LSmean.Y ~ Gender,            # outcome variable ~ grouping variable
  data = LifeSat.df,            # data frame
  skew= FALSE, range = FALSE,   # omit skew and range info
  mat = TRUE,                   # matrix format
  digits = 2)                   # round values to 2 digits
summary.gender                  # show descriptives
##           item    group1 vars  n mean   sd   se
## LSmean.Y1    1    Female    1 39 4.28 1.34 0.21
## LSmean.Y2    2      Male    1 18 4.28 1.54 0.36
## LSmean.Y3    3 NonBinary    1  4 2.10 1.36 0.68
## LSmean.Y4    4     Other    1  1 1.80   NA   NA


2.5 Remove scores

The regression coefficients (betas) calculated for the relationship between an outcome variable (Y) and a categorical predictor variable (e.g., gender) (X) reflect mean differences on Y across levels of X, so the extent to which these values are meaningful or appropriate for inclusion in an inferential test on the slope (beta) depends on the number of scores in each level (e.g., Female, Male, Non-binary, Other).

For this project, we will consider 10 to be the minimum number of scores that a level must have to be included in the analysis (note, this is a very low cutoff, which should be adjusted higher for most research projects). In the descriptive summary table, notice the “non-binary” and “other” samples have only 4 and 1 score, respectively, so unfortunately, these groups of scores are far too small to be considered for further analysis.

Use mutate() to remove specific scores from a variable (this will turn the score to NA).

library(dplyr)
LifeSat.df <- LifeSat.df %>% 
  mutate(Gender = na_if(Gender, "NonBinary")) %>% 
  mutate(Gender = na_if(Gender, "Other"))
head(LifeSat.df$Gender)
## [1] "Male"   "Male"   "Female" "Female" "Male"   "Female"

Now, rerun the descriptive summary table to update the values.

library(psych)
summary.gender <- describeBy(
  LSmean.Y ~ Gender,            # outcome variable ~ grouping variable
  data = LifeSat.df,            # data frame
  skew= FALSE, range = FALSE,   # omit skew and range info
  mat = TRUE,                   # matrix format
  digits = 2)                   # round values to 2 digits
summary.gender                  # show descriptives
##           item group1 vars  n mean   sd   se
## LSmean.Y1    1 Female    1 39 4.28 1.34 0.21
## LSmean.Y2    2   Male    1 18 4.28 1.54 0.36


2.6 Explore the relationships in your data

Use the pairs.panels() function in the psych package to examine the relationships of interest. This function gives a histogram for each variable as well as a scatter plot and correlation value for each bivariate relationship. Note, this function is used for exploratory purposes only.

library(dplyr)
library(psych)
LifeSat.df %>% 
  select(PhoneWO, SleepEn, SocialMediaRev, LSmean.Y) %>% 
  pairs.panels()


3 Visualize the Relationships in Your Data

Base R has many built-in plotting functions, but in most cases it will be easier to make a high quality data visualization using a dedicated plotting package, like ggplot2. Start by making sure you have the ggplot2 package installed.

To avoid readability issues that can occur with overlapping data points, all the plots in this section have slightly transparent points (alpha < 1) with a small amount of horizontal offset (jitter). Remove the jitter if you need each data point to be plotted at its exact x-y coordinate. See this article on Doing Better Data Visualization for more discussion (Advances in Methods and Practices in Psychological Science, 2021).


3.1 Scatter plots for quantitative variables

Create a scatter plot with a best-fit regression line and standard error to visualize the relationship between a quantitative predictor (X) and outcome (Y) variable.

Life Satisfaction X Long Time Without Phone

library(ggplot2)                                              # contains plotting functions
ggplot(LifeSat.df, aes(x = PhoneWO, y = LSmean.Y)) +          # data frame, aesthetic
  geom_point(position = position_jitter(0.1), na.rm = TRUE,   # plot points
             size = 3, color = "darkblue", alpha = 0.5) +     # size, color, and transparency of points
  geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) +  # plot line and se
  scale_x_continuous(name = "Long Time Without Phone") +      # x axis label
  scale_y_continuous(name = "Life Satisfaction") +            # y axis label
  theme_minimal() +                                           # visual format
  theme(text = element_text(size = 16))                       # text size


Life Satisfaction X Social Media Use

library(ggplot2)
ggplot(LifeSat.df, aes(x = SocialMediaRev, y = LSmean.Y)) +
  geom_point(position = position_jitter(0.1), na.rm = TRUE,
             size = 3, color = "darkblue", alpha = 0.5) +
  geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) +
  scale_x_continuous(name = "Social Media Use") +
  scale_y_continuous(name = "Life Satisfaction") +
  theme_minimal() +
  theme(text = element_text(size = 16))


Life Satisfaction X Get Enough Sleep

library(ggplot2)
ggplot(LifeSat.df, aes(x = SleepEn, y = LSmean.Y)) +
  geom_point(position = position_jitter(0.1), na.rm = TRUE,
             size = 3, color = "darkblue", alpha = 0.5) +
  geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) +
  scale_x_continuous(name = "Get Enough Sleep") +
  scale_y_continuous(name = "Life Satisfaction") +
  theme_minimal() +
  theme(text = element_text(size = 16))


3.2 Scatter plot for quantitative and categorical variables

Add a color element in the aesthetic to create a scatter plot that shows the relationship between a quantitative predictor (X) and outcome (Y) variable across levels of a categorical predictor variable.

library(ggplot2)                                                  
ggplot(LifeSat.df, aes(x = SleepEn, y = LSmean.Y, color = Gender)) +    # color by category
  geom_point(position = position_jitter(0.1), na.rm = TRUE, 
             size = 3, alpha = 0.5) +   
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +          
  scale_x_continuous(name = "Get Enough Sleep") +                         
  scale_y_continuous(name = "Life Satisfaction") +                        
  scale_color_manual(values = c("skyblue4", "darkorange3"),             # group colors
                     limits = c("Female", "Male"),                      # groups names
                     na.translate = FALSE) +                            # omit NAs
  theme_minimal() +                                                       
  theme(text = element_text(size = 16))                         


3.3 Strip chart for categorical predictor variables

Create a strip chart to visualize the relationship between a categorical predictor variable (Gender) and quantitative outcome variable (life satisfaction). A strip chart displays the means (diamond) and standard errors (error bars), similar to a classic bar plot, with the added benefit of showing all the raw scores in each category.

library(ggplot2)
ggplot(LifeSat.df, aes(x = Gender, y = LSmean.Y, color = Gender)) +     # color by category
  geom_jitter(na.rm = TRUE, position = position_jitter(0.1), 
              size = 3, alpha = 0.7) +
  stat_summary(na.rm = TRUE, fun = mean, geom = "point",                # add mean
               shape = 18, size = 3, color = "black") +  
  stat_summary(na.rm = TRUE, fun.data = mean_se, geom = "errorbar",     # add error bars
               width = .1, color = "black") + 
  scale_x_discrete(name = "Gender", limits = c("Female", "Male")) +     # groups names
  scale_y_continuous(name = "Life Satisfaction") +                      # y axis label
  theme_minimal() +                                                   
  theme(text = element_text(size = 18), legend.position = "none") +   
  scale_color_brewer(palette="Dark2")                                   # group colors


4 Build Behavioral Models from Your Data

The pairs panel and scatter plots above show that going without a phone, getting enough sleep, and social media use were all positively related to life satisfaction (Y). In this section, we will build a regression model for all three predictor variables.

4.1 Population models

The relationship between life satisfaction (Y) and one predictor variable (X) in the population can be expressed in a linear model:

\(\displaystyle Y = \beta_0 + \beta_{1}X_{1} + \epsilon\)

Where \(\beta_0\) = the Y intercept, \(\beta_{1}\) = the regression coefficient, or slope of the best-fit relationship, and \(\epsilon\) = error. In most cases, \(\beta_{1}\) is the main value of interest; when \(\beta_{1} = 0\) there is no relationship, and the more \(\beta_{1}\) differs from 0 in the negative or positive direction, the more the Y scores change with changes in X.

Additional predictor variables can be entered into the model as \(\beta X\) pairs:

\(\displaystyle Y = \beta_0 + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \epsilon\)


4.2 Sample models

When researchers do not have access to the population parameters, estimates for the regression coefficients and Y-intercept can be derived from sample data to create a sample regression model.

\(\displaystyle \hat{Y} = b_0 + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}\)

Where \(\hat{Y}\) is the predicted value of \(Y\), \(b_0\) is the estimated Y intercept, and \(b_{1}\), \(b_{2}\), and \(b_{3}\) are the estimated regression coefficients for each predictor variable.

Interpreting a regression coefficient: Let’s imagine we find that \(b_{1}\) = 0.5. This value indicates that for every 1 unit increase in the X score, \(X_{1}\), there is a 0.5 point increase in the predicted Y score, \(\hat{Y}\), assuming the other predictor variables, \(X_{2}\) and \(X_{3}\), remain unchanged (are held constant).


4.3 Null hypothesis significance testing

Null hypothesis significance testing (NHST) is a framework, or set of rules, for using sample data to make an inference about unknown population data.

Inference: Drawing a conclusion or making a logical judgment on the basis of evidence and prior conclusions rather than on the basis of direct observation.

  1. State the null hypothesis
  2. Define the critical region for the statistical decision: \(\alpha\) = .05
  3. Collect sample data and calculate statistics
  4. Make a statistical decision: Reject or fail to reject the null hypothesis


The analysis of multiple regression produces two types of hypothesis test.

  1. The F-statistic tests an omnibus null hypothesis stating that all predictors (X) and the outcome (Y) are not related in the population, that is, all regression coefficients are equal to zero: \(H_0: \beta_{phone} = \beta_{sleep} = \beta_{media} = 0\)

\(\displaystyle F = MS_{Regression} / MS_{Residual}\)


  1. The t-statistic tests a null hypothesis stating that one predictor (X) and the outcome (Y) are not related in the population, that is, the regression coefficient is equal to zero:
    \(H_0: \beta_{phone} = 0\)
    \(H_0: \beta_{sleep} = 0\)
    \(H_0: \beta_{media} = 0\)

\(\displaystyle t = b_1 / SE_{b_1}\)


Make a statistical decision

\(\text{If } p \le \text{.05, reject the null hypothesis, otherwise, fail to reject the null}\)


Remember that p is the probability that the statistic value would be obtained if the null hypothesis is true.

4.4 Mupltiple regression analysis

Use the lm() function to conduct an analysis of multiple regression on life satisfaction scores (Y) with three predictor variables, going without a phone, getting enough sleep, and social media use:

model.ls <- lm(formula = LSmean.Y ~ SleepEn + PhoneWO + SocialMediaRev, data = LifeSat.df)
summary(model.ls)
## 
## Call:
## lm(formula = LSmean.Y ~ SleepEn + PhoneWO + SocialMediaRev, data = LifeSat.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3738 -1.0873  0.1369  0.6909  2.7858 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      0.2869     0.8634   0.332  0.74096   
## SleepEn          0.5136     0.1628   3.155  0.00261 **
## PhoneWO          0.3879     0.1507   2.574  0.01279 * 
## SocialMediaRev   0.3595     0.1588   2.263  0.02758 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.221 on 55 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2921 
## F-statistic: 8.977 on 3 and 55 DF,  p-value: 6.187e-05
# show the analysis of regression source table
# library(car)
# Anova(model.ls)


The F-test on the omnibus null hypothesis shows the model significantly predicts life satisfaction scores, \(F(3,55) = 8.977\), \(p = .00006187\). Also shown is the residual standard error = 1.221 (standard error of estimate) and the adjusted R squared = 0.2921 (coefficient of determination adjusted for the number of predictors).

The null hypothesis tests for each regression coefficients is also shown. The slope for each predictor, \(b_{phone} = 0.3879\), \(b_{sleep} = 0.5136\), and \(b_{media} = 0.3595\) are shown in the “Estimate” column. The regression coefficient divided by its standard error (“Std. Error” column) gives the t-test value (“t value” column). The p-value for the t-test on each coefficient is shown in the “Pr(>t)” column. One or more asterisks at the end of each column indicates statistical significance.

4.5 Model assumptions and considerations

Linearity and homogeneity of variance: Plot the residuals to identify possible deviations from the assumptions about linearity and homogeneity of variance. Use this brief guide on assessing residual plots: Link

list.residual <- resid(model.ls)       # get list of residuals
plot(fitted(model.ls), list.residual)  # produce residual plot
abline(0,0)                           # add horizontal line at Y = 0



Normality: Create a quantile-quantile (QQ) plot to identify possible deviations from the normality assumption. Use this brief guide on assessing QQ plots: Link

list.stdres = rstandard(model.ls) # compute the standardized residuals
qqnorm(list.stdres,                  # plot residuals
       main = "QQ Polt")                 # plot title
qqline(list.stdres)                  # add QQ line



Multicollinearity is the degree of intercorrelation between predictor variables in a multiple regression model. Some relatedness between predictors is expected, but high intercorrelation indicates that one or more predictor variables is adding redundant information to the model. When redundant predictors are identified, factors may be removed, combined, or centered as a way to remedy the problem.

Check for redundant predictor variables in the model using variance inflation factors (VIF). A VIF = 1 indicates that a factor is uncorrelated with other factors, higher values indicate greater correlation (VIF > 2 may indicate a problem). Use this brief guide on assessing variance inflation factors: Link

library(car)
vif(model.ls)
##        SleepEn        PhoneWO SocialMediaRev 
##       1.113098       1.171663       1.219343


4.6 Communicate your results

Here’s an example of how the multiple regression results could be reported in an APA-style report.

An analysis of multiple linear regression showed that going without a phone, getting enough sleep, and social media use predicted a significant amount of variance in life satisfaction, \(F(3,55) = 8.977\), \(p < .001\), \(R_{adj.}^2 = 0.2921\). Participants reported significantly greater life satisfaction as going without a phone increased, \(b = 0.388\), \(SE = 0.151\), \(p = .013\), as sleeping enough increased, \(b = 0.514\), \(SE = 0.514\), \(p = .003\), and as social media use increased, \(b = 0.360\), \(SE = 0.159\), \(p = .028\).



5 Build Models with Moderating or Mediating Effects

Unlike data collected from experimental research designs, the correlational data collected in this survey project do not allow us to draw causal inferences about the unknown population, however, we study these kinds of relationships because we assume the variables really are causally connected in a specific temporal order. So, we build “explanatory” statistical models with correlational data given our assumptions about the causal connections, not because of how the data were collected. When we test null hypotheses about a model and identify significant relationships, it is important to remember that the results are still correlational and should not be used to draw causal conclusions.


5.1 Moderating effects

It is possible that the relationship between a predictor (X1) and outcome (Y) variable is affected by one or more moderating variables (X2). For example, comfort without a phone (X2) might moderate the degree to which social media use (X1) predicts life satisfaction (Y).


A simple moderation model


5.1.1 Interactions tests in the regression model

One or more tests for interaction effects can be added to a multiple regression model to identify moderating variables. Interaction terms can be added using the * or : symbol between two or more predictors. For example,

lm(formula = LSmean.Y ~ SleepEn*PhoneWO*SocialMediaRev, data = LifeSat.df)
lm(formula = LSmean.Y ~ SleepEn:PhoneWO:SocialMediaRev, data = LifeSat.df)

Multicollinearity may be high when raw scores are entered into a model like this because the interaction terms will tend to be highly correlated with one or more predictor variables.

Mean centered predictor scores will result in far lower intercorrelations (see Iacobucci et al., 2016). Transforming raw scores to mean centered scores is a simple process, just subtract the variable mean from each raw score, \(X - \bar{X}\).

Use the mutate() function in the dplyr package and the scale() function to create a new mean centered predictor variable from each of the original predictor variables.

library(dplyr)
LifeSat.df <- mutate(LifeSat.df, mcPhoneWO =  scale(PhoneWO, scale=FALSE)) %>%
              mutate(LifeSat.df, mcSleepEn =  scale(SleepEn, scale=FALSE)) %>%
              mutate(LifeSat.df, mcSocMedRev =  scale(SocialMediaRev, scale=FALSE))


Build a model that includes one or more interaction terms of interest using the new mean centered predictor variables. Note that each additional test entered into an analysis of multiple regression adds complexity and costs degrees of freedom, so in general, simpler models should be preferred over more complicated models, unless there is meaningful information gained from the additional test(s).

In this example, we’ll use the lm() function to model the individual effects on LSmean.Y of mcSleepEn, mcPhoneWO, and mcSocMedRev, as well one interaction term for mcPhoneWO x mcSocMedRev.

model.mod <- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO*mcSocMedRev, data = LifeSat.df)
summary(model.mod)
## 
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO * mcSocMedRev, 
##     data = LifeSat.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06043 -0.99322  0.08039  0.63128  2.73957 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.2190     0.1597  26.420  < 2e-16 ***
## mcSleepEn               0.4622     0.1579   2.927  0.00500 ** 
## mcPhoneWO               0.3941     0.1448   2.721  0.00874 ** 
## mcSocMedRev             0.2521     0.1592   1.583  0.11923    
## mcPhoneWO:mcSocMedRev   0.2654     0.1123   2.364  0.02171 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.173 on 54 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.3466 
## F-statistic: 8.691 on 4 and 54 DF,  p-value: 1.72e-05


Check for redundant predictor variables in the model using variance inflation factors (VIF).

library(car)
vif(model.mod)
##             mcSleepEn             mcPhoneWO           mcSocMedRev 
##              1.134577              1.172044              1.327503 
## mcPhoneWO:mcSocMedRev 
##              1.156021


5.1.2 Breaking down an interaction effect

Compared to the regression model without any interaction terms (model.ls), model.mod shows that social media use is no longer a significant predictor of life satisfaction, but its interaction with going without a phone is significant. When a moderation effect like this is detected, describing the effect requires “breaking down the interaction,” which involves plotting and testing the relationship between social media use (X1) and life satisfaction (Y) at several levels, or categories, of going without a phone (X2).

There are a few ways we could categorize the mcPhoneWO data. In some cases, simply splitting the scores into low and high categories is sufficient, but in other cases, creating three or more categories may provide more insight into the interaction effect.

  1. Use the median to split one predictor variable into Low and High categories.
library(dplyr)
median <- median(LifeSat.df$mcPhoneWO, na.rm = TRUE) # calculate and save the median value
LifeSat.df <- LifeSat.df %>%
     mutate(mcPhoneWOmedian = case_when(             # create a new variable, mcPhoneWOmedian
            mcPhoneWO <= median ~ "Low",             # label participants "Low" if their score is less than or equal to 0
            mcPhoneWO > median ~ "High"))            # label participants "High" if their score is greater than 0


Use describeBy in the psych package to create a descriptive summary table for life satisfaction scores across the Low and High categories.

library(psych)
summary.mcPhoneWOmedian <- describeBy(
  LSmean.Y ~ mcPhoneWOmedian,
  data = LifeSat.df,   
  skew= FALSE, 
  mat = TRUE, 
  digits = 2) 
summary.mcPhoneWOmedian  
##           item group1 vars  n mean   sd min max range   se
## LSmean.Y1    1   High    1 28 4.67 1.50 1.5 7.0   5.5 0.28
## LSmean.Y2    2    Low    1 33 3.71 1.29 1.0 5.8   4.8 0.22


Use ggplot to plot the relationship between social media use (X1) and life satisfaction (Y) across Low and High levels of being without a phone (X2).

library(ggplot2)                                                  
ggplot(LifeSat.df, aes(x = mcSocMedRev, y = LSmean.Y, color = mcPhoneWOmedian)) +
  geom_point(position = position_jitter(0.1), na.rm = TRUE, 
             size = 3, alpha = 0.5) +   
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +          
  scale_x_continuous(name = "Social Media (mean centered)") +                         
  scale_y_continuous(name = "Life Satisfaction") +                        
  scale_color_manual(name = "Phone W/O \n(mean centered)", 
                     values = c("skyblue4", "darkorange3"),
                     limits = c("Low", "High"),                 # select levels, and set legend order
                     na.translate = FALSE) + 
  theme_minimal() +                                                       
  theme(text = element_text(size = 16))                         


  1. Split one predictor variable into three or more categories using its standard deviation.
library(dplyr)
sd <- sd(LifeSat.df$mcPhoneWO, na.rm = TRUE)                           # calculate and save the sd value
LifeSat.df <- LifeSat.df %>%
     mutate(mcPhoneWOsd = case_when(                                   # create a new variable
            (mcPhoneWO <= sd*-.5) ~ "Low",                             # "Low" if less/equal to -.5 sd
            (mcPhoneWO > sd*-.5) & (mcPhoneWO < sd*.5) ~ "Moderate",   # "Moderate" if between -.5 and +.5 sd
            (mcPhoneWO >= sd*.5) ~ "High"))                            # "High" if greater/equal to .5 sd


Use describeBy in the psych package to create a descriptive summary table for life satisfaction scores across the Low, Moderate, and High categories.

library(psych)
summary.mcPhoneWOsd <- describeBy(
  LSmean.Y ~ mcPhoneWOsd,
  data = LifeSat.df,   
  skew= FALSE, 
  mat = TRUE, 
  digits = 2) 
summary.mcPhoneWOsd  
##           item   group1 vars  n mean   sd min max range   se
## LSmean.Y1    1     High    1 28 4.67 1.50 1.5 7.0   5.5 0.28
## LSmean.Y2    2      Low    1 20 3.67 1.37 1.0 5.8   4.8 0.31
## LSmean.Y3    3 Moderate    1 13 3.77 1.21 1.8 5.6   3.8 0.34


Use ggplot to plot the relationship between social media use (X1) and life satisfaction (Y) across the five SD levels of being without a phone (X2).

library(ggplot2)                                                  
ggplot(LifeSat.df, aes(x = mcSocMedRev, y = LSmean.Y, color = mcPhoneWOsd)) + 
  geom_point(position = position_jitter(0.1), na.rm = TRUE, 
             size = 3, alpha = 0.5) +   
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +          
  scale_x_continuous(name = "Social Media (mean centered)") +                         
  scale_y_continuous(name = "Life Satisfaction") +                        
  scale_color_brewer(name = "Phone W/O \n(mean centered)", 
                     palette="Dark2",
                     limits = c("Low", "Moderate", "High"),      # select levels, and set legend order
                     na.translate = FALSE) + 
  theme_minimal() +                                                       
  theme(text = element_text(size = 16))                         


5.1.3 Simple effects test

Now that we can see how the relationship between social media use and life satisfaction changes across levels of being without a phone, we can test whether each slope differs significantly from 0. From the descriptive summary tables, you can see that sample sizes can get quite low (n = 13) the more scores are split up, so we will use the median split data in the analysis of simple effects.

model.mod.slopes <- lm(formula = LSmean.Y ~ mcPhoneWOmedian:mcSocMedRev, data = LifeSat.df)
summary(model.mod.slopes)
## 
## Call:
## lm(formula = LSmean.Y ~ mcPhoneWOmedian:mcSocMedRev, data = LifeSat.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2126 -0.9558 -0.0126  1.1874  2.2573 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.23450    0.18395  23.020  < 2e-16 ***
## mcPhoneWOmedianHigh:mcSocMedRev  0.64658    0.21498   3.008  0.00391 ** 
## mcPhoneWOmedianLow:mcSocMedRev  -0.07031    0.25873  -0.272  0.78679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.388 on 57 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1375, Adjusted R-squared:  0.1072 
## F-statistic: 4.543 on 2 and 57 DF,  p-value: 0.01477



5.1.4 Communicate your results

Here’s an example of how these multiple regression results could be reported in an APA-style report.

An analysis of linear regression showed that mean centered scores on going without a phone, getting enough sleep, social media use, and an interaction term between going without a phone abd social media use predicted a significant amount of variance in life satisfaction, \(F(4,54) = 8.691\), \(p < .001\), \(R_{adj.}^2 = 0.3466\). Participants reported significantly greater life satisfaction as sleeping enough increased, \(b = 0.462\), \(SE = 0.158\), \(p = .005\). An analysis of linear regression and interaction plot on life satisfaction by social media use across participants below (Low) and above (High) the median score for going without a phone showed that agreement with the statement “Social media is part of your everyday activity” predicts greater life satisfaction, but only among those participants who also reported higher agreement with the statement “I can go a long time without looking at my phone,” \(b = 0.647\), \(SE = 0.215\), \(p = .004\). This finding suggests that being active on social media may be part of a healthy lifestyle for those people who are less susceptible to feeling dependent on their phone.


A note about causal language: It’s common to see researchers use terms like “effect” and “impact” when describing a relationship among correlational variables, e.g., “The impact of social media use on life satisfaction depends on how a person feels about being away from their phone.” The use of these terms is not meant to suggest that a causal connection can actually be drawn, rather, these terms come from a common vocabulary that researchers use to describe relationships among variables that are believed to be causally connected.



5.2 Mediating Effects

It is possible that a predictor variable (X1) has a direct influence an outcome variable (Y), and/or an indirect influence through a mediating variable (X2). For example, daily social media use (X1) may influence life satisfaction (Y) directly (path c), but it might also influence sleeping enough (X2) (path a), which in turn influences life satisfaction (path b).


Paths in a simple mediation model


A note about mediation: The following example provides a very simple guide for applying a mediation analysis to survey data, but there is much about this topic that is not discussed here because it goes well beyond what’s covered in a typical undergraduate course on behavioral statistics or research methods. If you’re looking for a more technical discussion about conducting mediation analyses in R, this tutorial on the levaan package is a good place to start.

A note about exploring data: The following mediation example is suitable in situations where past research or sound reasoning has led you to predict a specific mediation effect. If it is used for post-hoc exploratory analyses, be sure to inform your reader.

A note about scores: In the last section, we said that moderation models benefit from using mean centered scores, rather than raw scores, because it reduces potential problems that arise from multicollinearity. A model of moderated mediation would benefit from mean centered scores for the same reason. However, the problem does not apply to a simple mediation model, so we will use raw scores in the following example.

5.2.1 Model, test, and map a mediation effect

Mediation effects can be identified and tested using a two-part model.

  • LSmean.Y ~ SocialMediaRev + SleepEn: This first model looks like others we’ve built so far, where life satisfaction (Y) is regressed on two individual predictor variables, daily social media use (X1) and getting enough sleep (X2, mediator). This part of the model tests for relationships b and c in the mediation model above.
  • SleepEn ~ SocialMediaRev: This second model tests relationship a, where getting enough sleep (X2, mediator) is regressed on daily social media use (X1).

1. Save the two models in a new variable.

mediation.model <- "
  LSmean.Y ~ SocialMediaRev + SleepEn
  SleepEn ~ SocialMediaRev
"


2. Use the sem() function in the lavaan package to test the model.

library(lavaan)
mediation.fit <- sem(mediation.model, data = LifeSat.df)
summary(mediation.fit)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                            60          62
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LSmean.Y ~                                          
##     SocialMediaRev    0.245    0.151    1.628    0.104
##     SleepEn           0.638    0.162    3.935    0.000
##   SleepEn ~                                           
##     SocialMediaRev    0.243    0.116    2.100    0.036
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .LSmean.Y          1.586    0.290    5.477    0.000
##    .SleepEn           1.005    0.183    5.477    0.000


3. Map the paths (relationships) in the model.

library(semPlot)
semPaths(mediation.fit, "par",
             sizeMan = 15, sizeInt = 15, sizeLat = 15,
             edge.label.cex=1.5,
             fade=FALSE)



5.2.2 Communicate Your Results

Here’s an example of how these multiple regression results could be reported in an APA-style report.

We used a path analysis to test the direct effects of social media use on life satisfaction as well as its indirect effects through a mediating variable, getting enough sleep. The results showed that social media use did not have a direct influence on life satisfaction, however, higher social media use was a significant predictor of getting enough sleep, \(b = 0.243\), \(SE = 0.116\), \(p = .036\), and getting enough sleep was a significant predictor of higher life satisfaction, \(b = 0.638\), \(SE = 0.162\), \(p < .001\).



6 Model Comparisons

6.1 Regression models built with the lm() function

As we have seen in the sections above, the summary output of a linear regression analysis (lm) includes inferential statistical tests (t and F) and descriptive statistics, such as adjusted R-squared and residual standard error.



Regression results for LSmean.Y ~ SleepEn + PhoneWO + SocMedRev


Roughly speaking, adjusted R-squared (\(R_{adj}^2\)) indicates how well your model explains (fits) the observed data when adjusted for the number of predictors in the model, whereas residual standard error (RSE) indicates how well the model predicts the outcome variable when adjusted for the number of predictors in the model. When comparing two or models of an outcome variable, it’s not uncommon to see model selection based on the highest \(R_{adj}^2\) or lowest RSE, however, these values tend to favor models with too many predictor variables (bias) because they under penalize each predictor added to the model.

Bayesian Information Criteria (BIC) is a model selection tool that uses the obtained data to estimate how well the model would perform on a new data set. It applies a higher and higher penalty to each predictor added to a model, so it results in an unbiased estimate of the model fit on the unseen data set.

1. Create multiple models of your outcome variable and calculate the BIC: The first two models below replicate two of our previous linear models in life satisfaction, one without an interaction between the predictors and one with a two-way interaction. The third model is new, and has a three-way interaction between predictors.

With each model, the glance() function in the broom package is used to calculate the model’s BIC.

Model 1

model.1 <- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO + mcSocMedRev, data = LifeSat.df)
summary(model.1)
## 
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO + mcSocMedRev, 
##     data = LifeSat.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3738 -1.0873  0.1369  0.6909  2.7858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.1094     0.1591  25.835  < 2e-16 ***
## mcSleepEn     0.5136     0.1628   3.155  0.00261 ** 
## mcPhoneWO     0.3879     0.1507   2.574  0.01279 *  
## mcSocMedRev   0.3595     0.1588   2.263  0.02758 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.221 on 55 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2921 
## F-statistic: 8.977 on 3 and 55 DF,  p-value: 6.187e-05
library(broom)
glance(model.1)
## # A tibble: 1 × 12
##   r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
##     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
## 1   0.329   0.292  1.22    8.98 6.19e-5     3  -93.4  197.  207.    82.0      55
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## #   ¹​r.squared, ²​adj.r.squared, ³​statistic, ⁴​deviance, ⁵​df.residual
## # ℹ Use `colnames()` to see all variable names


Model 2

model.2 <- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO*mcSocMedRev, data = LifeSat.df)
summary(model.2)
## 
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO * mcSocMedRev, 
##     data = LifeSat.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06043 -0.99322  0.08039  0.63128  2.73957 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.2190     0.1597  26.420  < 2e-16 ***
## mcSleepEn               0.4622     0.1579   2.927  0.00500 ** 
## mcPhoneWO               0.3941     0.1448   2.721  0.00874 ** 
## mcSocMedRev             0.2521     0.1592   1.583  0.11923    
## mcPhoneWO:mcSocMedRev   0.2654     0.1123   2.364  0.02171 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.173 on 54 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.3466 
## F-statistic: 8.691 on 4 and 54 DF,  p-value: 1.72e-05
library(broom)
glance(model.2)
## # A tibble: 1 × 12
##   r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
##     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
## 1   0.392   0.347  1.17    8.69 1.72e-5     4  -90.5  193.  206.    74.4      54
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## #   ¹​r.squared, ²​adj.r.squared, ³​statistic, ⁴​deviance, ⁵​df.residual
## # ℹ Use `colnames()` to see all variable names


Model 3

model.3 <- lm(formula = LSmean.Y ~ mcSleepEn*mcPhoneWO*mcSocMedRev, data = LifeSat.df)
summary(model.3)
## 
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn * mcPhoneWO * mcSocMedRev, 
##     data = LifeSat.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0545 -0.9749  0.1459  0.7049  2.7455 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.208014   0.174425  24.125   <2e-16 ***
## mcSleepEn                       0.462880   0.173856   2.662   0.0104 *  
## mcPhoneWO                       0.375234   0.164898   2.276   0.0271 *  
## mcSocMedRev                     0.262791   0.171552   1.532   0.1317    
## mcSleepEn:mcPhoneWO             0.002235   0.165974   0.013   0.9893    
## mcSleepEn:mcSocMedRev           0.054906   0.174936   0.314   0.7549    
## mcPhoneWO:mcSocMedRev           0.292061   0.147520   1.980   0.0531 .  
## mcSleepEn:mcPhoneWO:mcSocMedRev 0.021219   0.136057   0.156   0.8767    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.206 on 51 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3935, Adjusted R-squared:  0.3102 
## F-statistic: 4.726 on 7 and 51 DF,  p-value: 0.0003805
library(broom)
glance(model.3)
## # A tibble: 1 × 12
##   r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
##     <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
## 1   0.393   0.310  1.21    4.73 3.81e-4     7  -90.5  199.  218.    74.1      51
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## #   ¹​r.squared, ²​adj.r.squared, ³​statistic, ⁴​deviance, ⁵​df.residual
## # ℹ Use `colnames()` to see all variable names


2. Evaluate the models: Give preference to simpler models and lower BIC scores.


6.2 Path models built with the sem() function in the lavaan package

1. Create multiple models of your outcome variable and calculate the BIC: The first model below is identical to our previous mediation model. The predictor and mediator have been switched in the second mediation model.

With each model, the fitMeasures() function in the lavaan package is used to calculate the model’s BIC.

Model 1

medMod.1 <- "
  LSmean.Y ~ SocialMediaRev + SleepEn
  SleepEn ~ SocialMediaRev
"


library(lavaan)
mediation.1 <- sem(medMod.1, data = LifeSat.df)
summary(mediation.1)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                            60          62
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LSmean.Y ~                                          
##     SocialMediaRev    0.245    0.151    1.628    0.104
##     SleepEn           0.638    0.162    3.935    0.000
##   SleepEn ~                                           
##     SocialMediaRev    0.243    0.116    2.100    0.036
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .LSmean.Y          1.586    0.290    5.477    0.000
##    .SleepEn           1.005    0.183    5.477    0.000
fitMeasures(mediation.1, "bic")                # calculate BIC
##     bic 
## 389.007


Model 2

medMod.2 <- "
  LSmean.Y ~ SocialMediaRev + SleepEn
  SocialMediaRev ~ SleepEn
"


library(lavaan)
mediation.2 <- sem(medMod.2, data = LifeSat.df)
summary(mediation.2)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                            60          62
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LSmean.Y ~                                          
##     SocialMediaRev    0.245    0.151    1.628    0.104
##     SleepEn           0.638    0.162    3.935    0.000
##   SocialMediaRev ~                                    
##     SleepEn           0.282    0.134    2.100    0.036
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .LSmean.Y          1.586    0.290    5.477    0.000
##    .SocialMediaRev    1.164    0.213    5.477    0.000
fitMeasures(mediation.2, "bic")                # calculate BIC
##     bic 
## 397.826


2. Evaluate the models: Give preference to simpler models and lower BIC scores.



7 Appendix A: Survey items


Gender affiliation

  1. Female
  2. Male
  3. Non-binary
  4. Other

I feel I get enough sleep

  1. Strongly Disagree
  2. Disagree
  3. Neither Agree nor Disagree
  4. Agree
  5. Strongly Agree

I can go a long time without looking at my phone

  1. Strongly Disagree
  2. Disagree
  3. Neither Agree nor Disagree
  4. Agree
  5. Strongly Agree

Social media is part of your everyday activity [note the section on reverse scoring]

  1. Strongly Agree
  2. Agree
  3. Neutral
  4. Disagree
  5. Strongly Disagree

Life Satisfaction Items 1-5

In most ways my life is close to my ideal. [1]
The conditions of my life are excellent. [2]
I am satisfied with my life. [3]
So far I have gotten the important things I want in life. [4]
If I could live my life over, I would change almost nothing. [5]

  1. Strongly Disagree
  2. Disagree
  3. Slightly Disagree
  4. Neither Agree nor Disagree
  5. Slightly Agree
  6. Agree
  7. Strongly Agree


8 Appendix B: Subsetting a data frame

In cases where you’re working with a data frame that has variables you don’t intend to use, it may be helpful to create a new data set with a subset of the variables.

Begin by getting the column names and numbers (index):

data.frame(colnames(LifeSat.df))
##    colnames.LifeSat.df.
## 1                     P
## 2                Gender
## 3                   LS1
## 4                   LS2
## 5                   LS3
## 6                   LS4
## 7                   LS5
## 8               PhoneWO
## 9               SleepEn
## 10          SocialMedia
## 11             LSmean.Y
## 12       SocialMediaRev
## 13            mcPhoneWO
## 14            mcSleepEn
## 15          mcSocMedRev
## 16      mcPhoneWOmedian
## 17          mcPhoneWOsd


Subset the data by saving the variables you will use into a new data frame. Using the code below, list the variables you want to add to the new data frame in c(). List individual columns separate by a comma, or list a range of columns using a colon. In this example, one predictor variables, SleepEn, and the outcome variable scores on life satisfaction (LS1 - LS5)

LifeSatShort.df <- LifeSat.df[c(3:7,9)]  # list variables to keep
head(LifeSatShort.df, 5)
##   LS1 LS2 LS3 LS4 LS5 SleepEn
## 1   6   6   6   5   6       4
## 2   2   1   2   2   2       1
## 3   7   7   7   7   7       4
## 4   7   5   7   6   7      NA
## 5   6   6   6   6   6       4


9 Appendix C: Rename variables (columns) in the data frame

The following commands use the dplyr package to rename specific variables.

library(dplyr)
LifeSatShort.df <- rename(LifeSatShort.df,      # data frame
                     EnoughSleep = SleepEn)     # new variable name = existing variable Name)
head(LifeSatShort.df, 5)
##   LS1 LS2 LS3 LS4 LS5 EnoughSleep
## 1   6   6   6   5   6           4
## 2   2   1   2   2   2           1
## 3   7   7   7   7   7           4
## 4   7   5   7   6   7          NA
## 5   6   6   6   6   6           4