The scores in the LifeSatis.csv data file below come from a correlational research study conducted by students in a behavioral research methods course. Participants were introductory psychology students (n = 62) who volunteered to complete an online survey.
Outcome and Predictor Variables: The researchers were interested in developing a model of life satisfaction (outcome variable) based on participants’ sleep, attachment to their phone, and social media use (predictor variables). See Appendix A for the full text of each survey item.
Hypotheses: Higher life satisfaction was expected to be associated with getting enough sleep, being comfortable without a phone, and having less engagement with social media.
After data collection for this study had ended, the participant
scores were exported from Qualtrics to comma-separated text file (.csv).
Whether the data are saved on your computer, on a shared drive, or a
server, use the read.csv()
function to copy the data and
save the values in a new data frame.
In this command, a new data frame named LifeSat.df is created from the data in the LifeSatis.csv file located on a server.
<- read.csv("https://andrewebrandt.github.io/object/datasets/LifeSatis.csv", header = TRUE)
LifeSat.df head(LifeSat.df, 10)
## P Gender LS1 LS2 LS3 LS4 LS5 PhoneWO SleepEn SocialMedia
## 1 1 2 6 6 6 5 6 4 4 2
## 2 2 2 2 1 2 2 2 2 1 3
## 3 3 1 7 7 7 7 7 5 4 1
## 4 4 1 7 5 7 6 7 5 NA 2
## 5 5 2 6 6 6 6 6 4 4 2
## 6 6 1 5 4 6 4 2 5 2 4
## 7 7 1 3 4 2 2 2 3 2 2
## 8 8 1 5 5 5 5 3 2 2 2
## 9 9 1 5 6 5 6 6 2 4 1
## 10 10 2 7 7 7 7 6 4 2 1
Use the same read.csv()
function to import .csv data
saved on your computer, just be sure to replace the URL with the correct
file path on your machine. Alternatively, use the “Import Dataset”
option in the Environment panel in RStudio (use “From text (base)” for
.csv files).
Use the write.csv()
function to save a data frame to
your computer as a .csv file, just be sure to change the file
path to a location on your machine.
# write.csv(LifeSat.df,"D:/My Drive/RStudio Working Directory/LifeSat.csv", row.names = FALSE)
Look at the first 10 rows of the LifeSat.df data frame above to see that the data is in a simple “2D” format where:
Next, carefully inspect the new data frame in the Environment panel in RStudio. Be sure you can identify all the variables in the data frame.
In some cases, you may want to create a smaller data frame with only a subset of the variables (see Appendix B) or rename one or more variables to make them easier to recognize (see Appendix C).
The LifeSat data set is relatively small and simple compared to many data sets you will encounter as a researcher, still, there are a small number of changes that are needed to transform this raw data into a tidy, usable form, a process called data wrangling. It is very important that you complete this process before you attempt to explore, visualize, or analyze the data.
Depending on the data set, data wrangling may be simple and fast, or it may present many challenges and require a lot of your time. This tutorial is designed to cover a small number of basic wrangling processes, but there are many more functions in the Tidy Data family that you will likely need to learn as you develop your analysis skills. Tidy Data Cheatsheet
For any psychological scale with multiple items, like the life satisfaction scale (LS1 through LS5), you will need to apply scoring rules to generate a full scale score or subscale scores. For most small scales, you can simply find the mean score of all the items.
Use the dplyr package to mutate()
a new
variable for the mean life satisfaction score (average of LS1
through LS5):
library(dplyr)
<- LifeSat.df %>%
LifeSat.df mutate(LSmean.Y = rowMeans(select(LifeSat.df, LS1:LS5), na.rm = TRUE))
head(LifeSat.df$LSmean.Y)
## [1] 5.8 1.8 7.0 6.4 6.0 4.2
Note, the na.rm = TRUE command tells the
rowMeans()
function to remove NA (empty) cells from the
mean calculation on each row of LS scores. In other words, the command
will calculate a life satisfaction mean for each participant even if a
participant skipped some LS items. If you wish to find the mean life
satisfaction score only for those participants who completed all five LS
items, change TRUE to FALSE.
Some psychological scales are designed with a mix of forward and reverse scored items, and one set will need to be reverse scored before being combined with others. In other cases, a researcher may inadvertently scale an individual item in the opposite direction as other items, in which case, they may choose to reverse score the item prior to any analysis to reduce the chance of confusion (just be sure to reverse the scale text as well).
Review the full survey items in Appendix A and notice that all but one Likert items were scaled in the same direction, where higher scores indicated stronger agreement with the statement (Strongly Agree = 5). Alternatively, higher scores on the social media use item indicated stronger disagreement with the statement (Strongly Disagree = 5).
Use mutate()
to create a new variable for reversed
social media scores.
library(dplyr)
<- # update existing data frame
LifeSat.df mutate(LifeSat.df, SocialMediaRev = # create a new variable, SocialMediaRev
case_when(SocialMedia == 1 ~ 5, # old score ~ new score
== 2 ~ 4,
SocialMedia == 3 ~ 3,
SocialMedia == 4 ~ 2,
SocialMedia == 5 ~ 1)) SocialMedia
Examine the Gender variable in the “Environment” tab and notice that it was imported as in integer type, rather than categorical type variable due to the numeric coding of each level.
Use mutate()
to change the Gender variable to a
categorical type, and replace the numerical values with category
names.
library(dplyr)
<- LifeSat.df %>% # update data set
LifeSat.df mutate(Gender = as.character(Gender)) %>% # change Gender from integer to character
mutate(Gender = case_when( # change condition labels
== "1" ~ "Female",
Gender == "2" ~ "Male",
Gender == "3" ~ "NonBinary",
Gender == "4" ~ "Other")) Gender
Now that the data have been wrangled into a more usable form, you can examine and summarize the scores within each variable and assess the relationships between variables.
Create a descriptive summary table for one or more
quantitative variables. This example introduces the
psych package, which contains the
describe()
and describeBy()
functions for
finding descriptive statistics.
library(dplyr) # load dplyr package
library(psych) # load psych package
<- LifeSat.df %>% # create a new df from an existing df
summary.quant select(PhoneWO, SleepEn, SocialMediaRev, LSmean.Y) %>% # select variables
describe(skew= FALSE) # apply the describe() function
# show the results summary.quant
## vars n mean sd min max range se
## PhoneWO 1 61 3.18 1.19 1 5 4 0.15
## SleepEn 2 61 2.46 1.06 1 4 3 0.14
## SocialMediaRev 3 61 3.69 1.12 1 5 4 0.14
## LSmean.Y 4 62 4.10 1.50 1 7 6 0.19
Create a descriptive summary table for the outcome variable scores (mean life satisfaction) across levels of a categorical predictor variable (gender).
library(psych)
<- describeBy(
summary.gender ~ Gender, # outcome variable ~ grouping variable
LSmean.Y data = LifeSat.df, # data frame
skew= FALSE, range = FALSE, # omit skew and range info
mat = TRUE, # matrix format
digits = 2) # round values to 2 digits
# show descriptives summary.gender
## item group1 vars n mean sd se
## LSmean.Y1 1 Female 1 39 4.28 1.34 0.21
## LSmean.Y2 2 Male 1 18 4.28 1.54 0.36
## LSmean.Y3 3 NonBinary 1 4 2.10 1.36 0.68
## LSmean.Y4 4 Other 1 1 1.80 NA NA
The regression coefficients (betas) calculated for the relationship between an outcome variable (Y) and a categorical predictor variable (e.g., gender) (X) reflect mean differences on Y across levels of X, so the extent to which these values are meaningful or appropriate for inclusion in an inferential test on the slope (beta) depends on the number of scores in each level (e.g., Female, Male, Non-binary, Other).
For this project, we will consider 10 to be the minimum number of scores that a level must have to be included in the analysis (note, this is a very low cutoff, which should be adjusted higher for most research projects). In the descriptive summary table, notice the “non-binary” and “other” samples have only 4 and 1 score, respectively, so unfortunately, these groups of scores are far too small to be considered for further analysis.
Use mutate()
to remove specific scores from a variable
(this will turn the score to NA).
library(dplyr)
<- LifeSat.df %>%
LifeSat.df mutate(Gender = na_if(Gender, "NonBinary")) %>%
mutate(Gender = na_if(Gender, "Other"))
head(LifeSat.df$Gender)
## [1] "Male" "Male" "Female" "Female" "Male" "Female"
Now, rerun the descriptive summary table to update the values.
library(psych)
<- describeBy(
summary.gender ~ Gender, # outcome variable ~ grouping variable
LSmean.Y data = LifeSat.df, # data frame
skew= FALSE, range = FALSE, # omit skew and range info
mat = TRUE, # matrix format
digits = 2) # round values to 2 digits
# show descriptives summary.gender
## item group1 vars n mean sd se
## LSmean.Y1 1 Female 1 39 4.28 1.34 0.21
## LSmean.Y2 2 Male 1 18 4.28 1.54 0.36
Use the pairs.panels()
function in the psych
package to examine the relationships of interest. This function gives a
histogram for each variable as well as a scatter plot and correlation
value for each bivariate relationship. Note, this function is used for
exploratory purposes only.
library(dplyr)
library(psych)
%>%
LifeSat.df select(PhoneWO, SleepEn, SocialMediaRev, LSmean.Y) %>%
pairs.panels()
Base R has many built-in plotting functions, but in most cases it will be easier to make a high quality data visualization using a dedicated plotting package, like ggplot2. Start by making sure you have the ggplot2 package installed.
To avoid readability issues that can occur with overlapping data points, all the plots in this section have slightly transparent points (alpha < 1) with a small amount of horizontal offset (jitter). Remove the jitter if you need each data point to be plotted at its exact x-y coordinate. See this article on Doing Better Data Visualization for more discussion (Advances in Methods and Practices in Psychological Science, 2021).
Create a scatter plot with a best-fit regression line and standard error to visualize the relationship between a quantitative predictor (X) and outcome (Y) variable.
Life Satisfaction X Long Time Without Phone
library(ggplot2) # contains plotting functions
ggplot(LifeSat.df, aes(x = PhoneWO, y = LSmean.Y)) + # data frame, aesthetic
geom_point(position = position_jitter(0.1), na.rm = TRUE, # plot points
size = 3, color = "darkblue", alpha = 0.5) + # size, color, and transparency of points
geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) + # plot line and se
scale_x_continuous(name = "Long Time Without Phone") + # x axis label
scale_y_continuous(name = "Life Satisfaction") + # y axis label
theme_minimal() + # visual format
theme(text = element_text(size = 16)) # text size
Life Satisfaction X Social Media Use
library(ggplot2)
ggplot(LifeSat.df, aes(x = SocialMediaRev, y = LSmean.Y)) +
geom_point(position = position_jitter(0.1), na.rm = TRUE,
size = 3, color = "darkblue", alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) +
scale_x_continuous(name = "Social Media Use") +
scale_y_continuous(name = "Life Satisfaction") +
theme_minimal() +
theme(text = element_text(size = 16))
Life Satisfaction X Get Enough Sleep
library(ggplot2)
ggplot(LifeSat.df, aes(x = SleepEn, y = LSmean.Y)) +
geom_point(position = position_jitter(0.1), na.rm = TRUE,
size = 3, color = "darkblue", alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, na.rm = TRUE) +
scale_x_continuous(name = "Get Enough Sleep") +
scale_y_continuous(name = "Life Satisfaction") +
theme_minimal() +
theme(text = element_text(size = 16))
Add a color element in the aesthetic to create a scatter plot that shows the relationship between a quantitative predictor (X) and outcome (Y) variable across levels of a categorical predictor variable.
library(ggplot2)
ggplot(LifeSat.df, aes(x = SleepEn, y = LSmean.Y, color = Gender)) + # color by category
geom_point(position = position_jitter(0.1), na.rm = TRUE,
size = 3, alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +
scale_x_continuous(name = "Get Enough Sleep") +
scale_y_continuous(name = "Life Satisfaction") +
scale_color_manual(values = c("skyblue4", "darkorange3"), # group colors
limits = c("Female", "Male"), # groups names
na.translate = FALSE) + # omit NAs
theme_minimal() +
theme(text = element_text(size = 16))
Create a strip chart to visualize the relationship between a categorical predictor variable (Gender) and quantitative outcome variable (life satisfaction). A strip chart displays the means (diamond) and standard errors (error bars), similar to a classic bar plot, with the added benefit of showing all the raw scores in each category.
library(ggplot2)
ggplot(LifeSat.df, aes(x = Gender, y = LSmean.Y, color = Gender)) + # color by category
geom_jitter(na.rm = TRUE, position = position_jitter(0.1),
size = 3, alpha = 0.7) +
stat_summary(na.rm = TRUE, fun = mean, geom = "point", # add mean
shape = 18, size = 3, color = "black") +
stat_summary(na.rm = TRUE, fun.data = mean_se, geom = "errorbar", # add error bars
width = .1, color = "black") +
scale_x_discrete(name = "Gender", limits = c("Female", "Male")) + # groups names
scale_y_continuous(name = "Life Satisfaction") + # y axis label
theme_minimal() +
theme(text = element_text(size = 18), legend.position = "none") +
scale_color_brewer(palette="Dark2") # group colors
The pairs panel and scatter plots above show that going without a phone, getting enough sleep, and social media use were all positively related to life satisfaction (Y). In this section, we will build a regression model for all three predictor variables.
The relationship between life satisfaction (Y) and one predictor variable (X) in the population can be expressed in a linear model:
\(\displaystyle Y = \beta_0 + \beta_{1}X_{1} + \epsilon\)
Where \(\beta_0\) = the Y intercept,
\(\beta_{1}\) = the regression
coefficient, or slope of the best-fit relationship, and \(\epsilon\) = error. In most cases, \(\beta_{1}\) is the main value of interest;
when \(\beta_{1} = 0\) there is no
relationship, and the more \(\beta_{1}\) differs from 0 in the negative
or positive direction, the more the Y scores change with changes in X.
Additional predictor variables can be entered into the model as \(\beta X\) pairs:
\(\displaystyle Y = \beta_0 + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \epsilon\)
When researchers do not have access to the population parameters, estimates for the regression coefficients and Y-intercept can be derived from sample data to create a sample regression model.
\(\displaystyle \hat{Y} = b_0 + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}\)
Where \(\hat{Y}\) is the predicted
value of \(Y\), \(b_0\) is the estimated Y intercept, and
\(b_{1}\), \(b_{2}\), and \(b_{3}\) are the estimated regression
coefficients for each predictor variable.
Interpreting a regression coefficient: Let’s imagine we find that \(b_{1}\) = 0.5. This value indicates that for every 1 unit increase in the X score, \(X_{1}\), there is a 0.5 point increase in the predicted Y score, \(\hat{Y}\), assuming the other predictor variables, \(X_{2}\) and \(X_{3}\), remain unchanged (are held constant).
Null hypothesis significance testing (NHST) is a framework, or set of rules, for using sample data to make an inference about unknown population data.
Inference: Drawing a conclusion or making a logical judgment on the basis of evidence and prior conclusions rather than on the basis of direct observation.
The analysis of multiple regression produces two types of hypothesis test.
Make a statistical decision
Remember that p is the probability that the statistic value would be
obtained if the null hypothesis is true.
Use the lm()
function to conduct an analysis of multiple
regression on life satisfaction scores (Y) with three predictor
variables, going without a phone, getting enough sleep, and social media
use:
<- lm(formula = LSmean.Y ~ SleepEn + PhoneWO + SocialMediaRev, data = LifeSat.df)
model.ls summary(model.ls)
##
## Call:
## lm(formula = LSmean.Y ~ SleepEn + PhoneWO + SocialMediaRev, data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3738 -1.0873 0.1369 0.6909 2.7858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2869 0.8634 0.332 0.74096
## SleepEn 0.5136 0.1628 3.155 0.00261 **
## PhoneWO 0.3879 0.1507 2.574 0.01279 *
## SocialMediaRev 0.3595 0.1588 2.263 0.02758 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.221 on 55 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3287, Adjusted R-squared: 0.2921
## F-statistic: 8.977 on 3 and 55 DF, p-value: 6.187e-05
# show the analysis of regression source table
# library(car)
# Anova(model.ls)
The F-test on the omnibus null hypothesis shows the model significantly predicts life satisfaction scores, \(F(3,55) = 8.977\), \(p = .00006187\). Also shown is the residual standard error = 1.221 (standard error of estimate) and the adjusted R squared = 0.2921 (coefficient of determination adjusted for the number of predictors).
The null hypothesis tests for each regression coefficients is also
shown. The slope for each predictor, \(b_{phone} = 0.3879\), \(b_{sleep} = 0.5136\), and \(b_{media} = 0.3595\) are shown in the
“Estimate” column. The regression coefficient divided by its standard
error (“Std. Error” column) gives the t-test value (“t value” column).
The p-value for the t-test on each coefficient is shown in the
“Pr(>t)” column. One or more asterisks at the end of each column
indicates statistical significance.
Linearity and homogeneity of variance: Plot the residuals to identify possible deviations from the assumptions about linearity and homogeneity of variance. Use this brief guide on assessing residual plots: Link
<- resid(model.ls) # get list of residuals
list.residual plot(fitted(model.ls), list.residual) # produce residual plot
abline(0,0) # add horizontal line at Y = 0
Normality: Create a quantile-quantile (QQ) plot to identify possible deviations from the normality assumption. Use this brief guide on assessing QQ plots: Link
= rstandard(model.ls) # compute the standardized residuals
list.stdres qqnorm(list.stdres, # plot residuals
main = "QQ Polt") # plot title
qqline(list.stdres) # add QQ line
Multicollinearity is the degree of intercorrelation between predictor variables in a multiple regression model. Some relatedness between predictors is expected, but high intercorrelation indicates that one or more predictor variables is adding redundant information to the model. When redundant predictors are identified, factors may be removed, combined, or centered as a way to remedy the problem.
Check for redundant predictor variables in the model using variance inflation factors (VIF). A VIF = 1 indicates that a factor is uncorrelated with other factors, higher values indicate greater correlation (VIF > 2 may indicate a problem). Use this brief guide on assessing variance inflation factors: Link
library(car)
vif(model.ls)
## SleepEn PhoneWO SocialMediaRev
## 1.113098 1.171663 1.219343
Here’s an example of how the multiple regression results could be reported in an APA-style report.
An analysis of multiple linear regression showed that going without a phone, getting enough sleep, and social media use predicted a significant amount of variance in life satisfaction, \(F(3,55) = 8.977\), \(p < .001\), \(R_{adj.}^2 = 0.2921\). Participants reported significantly greater life satisfaction as going without a phone increased, \(b = 0.388\), \(SE = 0.151\), \(p = .013\), as sleeping enough increased, \(b = 0.514\), \(SE = 0.514\), \(p = .003\), and as social media use increased, \(b = 0.360\), \(SE = 0.159\), \(p = .028\).
Unlike data collected from experimental research designs, the correlational data collected in this survey project do not allow us to draw causal inferences about the unknown population, however, we study these kinds of relationships because we assume the variables really are causally connected in a specific temporal order. So, we build “explanatory” statistical models with correlational data given our assumptions about the causal connections, not because of how the data were collected. When we test null hypotheses about a model and identify significant relationships, it is important to remember that the results are still correlational and should not be used to draw causal conclusions.
It is possible that the relationship between a predictor (X1) and outcome (Y) variable is affected by one or more moderating variables (X2). For example, comfort without a phone (X2) might moderate the degree to which social media use (X1) predicts life satisfaction (Y).
One or more tests for interaction effects can be added to a multiple regression model to identify moderating variables. Interaction terms can be added using the * or : symbol between two or more predictors. For example,
lm(formula = LSmean.Y ~ SleepEn*PhoneWO*SocialMediaRev, data = LifeSat.df)
lm(formula = LSmean.Y ~ SleepEn:PhoneWO:SocialMediaRev, data = LifeSat.df)
Multicollinearity may be high when raw scores are
entered into a model like this because the interaction terms will tend
to be highly correlated with one or more predictor variables.
Mean centered predictor scores will result in far lower intercorrelations (see Iacobucci et al., 2016). Transforming raw scores to mean centered scores is a simple process, just subtract the variable mean from each raw score, \(X - \bar{X}\).
Use the mutate()
function in the dplyr package
and the scale()
function to create a new mean centered
predictor variable from each of the original predictor variables.
library(dplyr)
<- mutate(LifeSat.df, mcPhoneWO = scale(PhoneWO, scale=FALSE)) %>%
LifeSat.df mutate(LifeSat.df, mcSleepEn = scale(SleepEn, scale=FALSE)) %>%
mutate(LifeSat.df, mcSocMedRev = scale(SocialMediaRev, scale=FALSE))
Build a model that includes one or more interaction terms of interest using the new mean centered predictor variables. Note that each additional test entered into an analysis of multiple regression adds complexity and costs degrees of freedom, so in general, simpler models should be preferred over more complicated models, unless there is meaningful information gained from the additional test(s).
In this example, we’ll use the lm()
function to model
the individual effects on LSmean.Y of mcSleepEn, mcPhoneWO, and
mcSocMedRev, as well one interaction term for mcPhoneWO x
mcSocMedRev.
<- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO*mcSocMedRev, data = LifeSat.df)
model.mod summary(model.mod)
##
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO * mcSocMedRev,
## data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06043 -0.99322 0.08039 0.63128 2.73957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2190 0.1597 26.420 < 2e-16 ***
## mcSleepEn 0.4622 0.1579 2.927 0.00500 **
## mcPhoneWO 0.3941 0.1448 2.721 0.00874 **
## mcSocMedRev 0.2521 0.1592 1.583 0.11923
## mcPhoneWO:mcSocMedRev 0.2654 0.1123 2.364 0.02171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.173 on 54 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3917, Adjusted R-squared: 0.3466
## F-statistic: 8.691 on 4 and 54 DF, p-value: 1.72e-05
Check for redundant predictor variables in the model using variance inflation factors (VIF).
library(car)
vif(model.mod)
## mcSleepEn mcPhoneWO mcSocMedRev
## 1.134577 1.172044 1.327503
## mcPhoneWO:mcSocMedRev
## 1.156021
Compared to the regression model without any interaction terms (model.ls), model.mod shows that social media use is no longer a significant predictor of life satisfaction, but its interaction with going without a phone is significant. When a moderation effect like this is detected, describing the effect requires “breaking down the interaction,” which involves plotting and testing the relationship between social media use (X1) and life satisfaction (Y) at several levels, or categories, of going without a phone (X2).
There are a few ways we could categorize the mcPhoneWO data. In some cases, simply splitting the scores into low and high categories is sufficient, but in other cases, creating three or more categories may provide more insight into the interaction effect.
library(dplyr)
<- median(LifeSat.df$mcPhoneWO, na.rm = TRUE) # calculate and save the median value
median <- LifeSat.df %>%
LifeSat.df mutate(mcPhoneWOmedian = case_when( # create a new variable, mcPhoneWOmedian
<= median ~ "Low", # label participants "Low" if their score is less than or equal to 0
mcPhoneWO > median ~ "High")) # label participants "High" if their score is greater than 0 mcPhoneWO
Use describeBy
in the psych package to create a
descriptive summary table for life satisfaction scores across the Low
and High categories.
library(psych)
<- describeBy(
summary.mcPhoneWOmedian ~ mcPhoneWOmedian,
LSmean.Y data = LifeSat.df,
skew= FALSE,
mat = TRUE,
digits = 2)
summary.mcPhoneWOmedian
## item group1 vars n mean sd min max range se
## LSmean.Y1 1 High 1 28 4.67 1.50 1.5 7.0 5.5 0.28
## LSmean.Y2 2 Low 1 33 3.71 1.29 1.0 5.8 4.8 0.22
Use ggplot
to plot the relationship between social media
use (X1) and life satisfaction (Y) across Low and High levels of being
without a phone (X2).
library(ggplot2)
ggplot(LifeSat.df, aes(x = mcSocMedRev, y = LSmean.Y, color = mcPhoneWOmedian)) +
geom_point(position = position_jitter(0.1), na.rm = TRUE,
size = 3, alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +
scale_x_continuous(name = "Social Media (mean centered)") +
scale_y_continuous(name = "Life Satisfaction") +
scale_color_manual(name = "Phone W/O \n(mean centered)",
values = c("skyblue4", "darkorange3"),
limits = c("Low", "High"), # select levels, and set legend order
na.translate = FALSE) +
theme_minimal() +
theme(text = element_text(size = 16))
library(dplyr)
<- sd(LifeSat.df$mcPhoneWO, na.rm = TRUE) # calculate and save the sd value
sd <- LifeSat.df %>%
LifeSat.df mutate(mcPhoneWOsd = case_when( # create a new variable
<= sd*-.5) ~ "Low", # "Low" if less/equal to -.5 sd
(mcPhoneWO > sd*-.5) & (mcPhoneWO < sd*.5) ~ "Moderate", # "Moderate" if between -.5 and +.5 sd
(mcPhoneWO >= sd*.5) ~ "High")) # "High" if greater/equal to .5 sd (mcPhoneWO
Use describeBy
in the psych package to create a
descriptive summary table for life satisfaction scores across the Low,
Moderate, and High categories.
library(psych)
<- describeBy(
summary.mcPhoneWOsd ~ mcPhoneWOsd,
LSmean.Y data = LifeSat.df,
skew= FALSE,
mat = TRUE,
digits = 2)
summary.mcPhoneWOsd
## item group1 vars n mean sd min max range se
## LSmean.Y1 1 High 1 28 4.67 1.50 1.5 7.0 5.5 0.28
## LSmean.Y2 2 Low 1 20 3.67 1.37 1.0 5.8 4.8 0.31
## LSmean.Y3 3 Moderate 1 13 3.77 1.21 1.8 5.6 3.8 0.34
Use ggplot
to plot the relationship between social media
use (X1) and life satisfaction (Y) across the five SD levels of being
without a phone (X2).
library(ggplot2)
ggplot(LifeSat.df, aes(x = mcSocMedRev, y = LSmean.Y, color = mcPhoneWOsd)) +
geom_point(position = position_jitter(0.1), na.rm = TRUE,
size = 3, alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, na.rm = TRUE) +
scale_x_continuous(name = "Social Media (mean centered)") +
scale_y_continuous(name = "Life Satisfaction") +
scale_color_brewer(name = "Phone W/O \n(mean centered)",
palette="Dark2",
limits = c("Low", "Moderate", "High"), # select levels, and set legend order
na.translate = FALSE) +
theme_minimal() +
theme(text = element_text(size = 16))
Now that we can see how the relationship between social media use and life satisfaction changes across levels of being without a phone, we can test whether each slope differs significantly from 0. From the descriptive summary tables, you can see that sample sizes can get quite low (n = 13) the more scores are split up, so we will use the median split data in the analysis of simple effects.
<- lm(formula = LSmean.Y ~ mcPhoneWOmedian:mcSocMedRev, data = LifeSat.df)
model.mod.slopes summary(model.mod.slopes)
##
## Call:
## lm(formula = LSmean.Y ~ mcPhoneWOmedian:mcSocMedRev, data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2126 -0.9558 -0.0126 1.1874 2.2573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.23450 0.18395 23.020 < 2e-16 ***
## mcPhoneWOmedianHigh:mcSocMedRev 0.64658 0.21498 3.008 0.00391 **
## mcPhoneWOmedianLow:mcSocMedRev -0.07031 0.25873 -0.272 0.78679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.388 on 57 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.1375, Adjusted R-squared: 0.1072
## F-statistic: 4.543 on 2 and 57 DF, p-value: 0.01477
Here’s an example of how these multiple regression results could be reported in an APA-style report.
An analysis of
linear regression showed that mean centered scores on going without a
phone, getting enough sleep, social media use, and an interaction term
between going without a phone abd social media use predicted a
significant amount of variance in life satisfaction, \(F(4,54) = 8.691\), \(p < .001\), \(R_{adj.}^2 = 0.3466\). Participants
reported significantly greater life satisfaction as sleeping enough
increased, \(b = 0.462\), \(SE = 0.158\), \(p
= .005\). An analysis of linear regression and interaction plot
on life satisfaction by social media use across participants below (Low)
and above (High) the median score for going without a phone showed that
agreement with the statement “Social media is part of your everyday
activity” predicts greater life satisfaction, but only among those
participants who also reported higher agreement with the statement “I
can go a long time without looking at my phone,” \(b = 0.647\), \(SE
= 0.215\), \(p = .004\). This
finding suggests that being active on social media may be part of a
healthy lifestyle for those people who are less susceptible to feeling
dependent on their phone.
A note about causal language: It’s common to see researchers use terms like “effect” and “impact” when describing a relationship among correlational variables, e.g., “The impact of social media use on life satisfaction depends on how a person feels about being away from their phone.” The use of these terms is not meant to suggest that a causal connection can actually be drawn, rather, these terms come from a common vocabulary that researchers use to describe relationships among variables that are believed to be causally connected.
It is possible that a predictor variable (X1) has a direct influence an outcome variable (Y), and/or an indirect influence through a mediating variable (X2). For example, daily social media use (X1) may influence life satisfaction (Y) directly (path c), but it might also influence sleeping enough (X2) (path a), which in turn influences life satisfaction (path b).
A note about mediation: The following example provides a very simple guide for applying a mediation analysis to survey data, but there is much about this topic that is not discussed here because it goes well beyond what’s covered in a typical undergraduate course on behavioral statistics or research methods. If you’re looking for a more technical discussion about conducting mediation analyses in R, this tutorial on the levaan package is a good place to start.
A note about exploring data: The following mediation example is suitable in situations where past research or sound reasoning has led you to predict a specific mediation effect. If it is used for post-hoc exploratory analyses, be sure to inform your reader.
A note about scores: In the last section, we said that moderation models benefit from using mean centered scores, rather than raw scores, because it reduces potential problems that arise from multicollinearity. A model of moderated mediation would benefit from mean centered scores for the same reason. However, the problem does not apply to a simple mediation model, so we will use raw scores in the following example.
Mediation effects can be identified and tested using a two-part model.
1. Save the two models in a new variable.
<- "
mediation.model LSmean.Y ~ SocialMediaRev + SleepEn
SleepEn ~ SocialMediaRev
"
2. Use the sem()
function in the lavaan
package to test the model.
library(lavaan)
<- sem(mediation.model, data = LifeSat.df)
mediation.fit summary(mediation.fit)
## lavaan 0.6-12 ended normally after 1 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Used Total
## Number of observations 60 62
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## LSmean.Y ~
## SocialMediaRev 0.245 0.151 1.628 0.104
## SleepEn 0.638 0.162 3.935 0.000
## SleepEn ~
## SocialMediaRev 0.243 0.116 2.100 0.036
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .LSmean.Y 1.586 0.290 5.477 0.000
## .SleepEn 1.005 0.183 5.477 0.000
3. Map the paths (relationships) in the model.
library(semPlot)
semPaths(mediation.fit, "par",
sizeMan = 15, sizeInt = 15, sizeLat = 15,
edge.label.cex=1.5,
fade=FALSE)
Here’s an example of how these multiple regression results could be reported in an APA-style report.
We used a path analysis to test the direct effects of social media use on life satisfaction as well as its indirect effects through a mediating variable, getting enough sleep. The results showed that social media use did not have a direct influence on life satisfaction, however, higher social media use was a significant predictor of getting enough sleep, \(b = 0.243\), \(SE = 0.116\), \(p = .036\), and getting enough sleep was a significant predictor of higher life satisfaction, \(b = 0.638\), \(SE = 0.162\), \(p < .001\).
lm()
functionAs we have seen in the sections above, the summary output of a linear regression analysis (lm) includes inferential statistical tests (t and F) and descriptive statistics, such as adjusted R-squared and residual standard error.
Roughly speaking, adjusted R-squared (\(R_{adj}^2\)) indicates how well your model
explains (fits) the observed data when adjusted for the number of
predictors in the model, whereas residual standard error (RSE) indicates
how well the model predicts the outcome variable when adjusted for the
number of predictors in the model. When comparing two or models of an
outcome variable, it’s not uncommon to see model selection based on the
highest \(R_{adj}^2\) or lowest RSE,
however, these values tend to favor models with too many predictor
variables (bias) because they under penalize each predictor added to the
model.
Bayesian Information Criteria (BIC) is a model selection tool that uses the obtained data to estimate how well the model would perform on a new data set. It applies a higher and higher penalty to each predictor added to a model, so it results in an unbiased estimate of the model fit on the unseen data set.
1. Create multiple models of your outcome variable and calculate the BIC: The first two models below replicate two of our previous linear models in life satisfaction, one without an interaction between the predictors and one with a two-way interaction. The third model is new, and has a three-way interaction between predictors.
With each model, the glance()
function in the
broom package is used to calculate the model’s BIC.
Model 1
.1 <- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO + mcSocMedRev, data = LifeSat.df)
modelsummary(model.1)
##
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO + mcSocMedRev,
## data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3738 -1.0873 0.1369 0.6909 2.7858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1094 0.1591 25.835 < 2e-16 ***
## mcSleepEn 0.5136 0.1628 3.155 0.00261 **
## mcPhoneWO 0.3879 0.1507 2.574 0.01279 *
## mcSocMedRev 0.3595 0.1588 2.263 0.02758 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.221 on 55 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3287, Adjusted R-squared: 0.2921
## F-statistic: 8.977 on 3 and 55 DF, p-value: 6.187e-05
library(broom)
glance(model.1)
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.329 0.292 1.22 8.98 6.19e-5 3 -93.4 197. 207. 82.0 55
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
## # ℹ Use `colnames()` to see all variable names
Model 2
.2 <- lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO*mcSocMedRev, data = LifeSat.df)
modelsummary(model.2)
##
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn + mcPhoneWO * mcSocMedRev,
## data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06043 -0.99322 0.08039 0.63128 2.73957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2190 0.1597 26.420 < 2e-16 ***
## mcSleepEn 0.4622 0.1579 2.927 0.00500 **
## mcPhoneWO 0.3941 0.1448 2.721 0.00874 **
## mcSocMedRev 0.2521 0.1592 1.583 0.11923
## mcPhoneWO:mcSocMedRev 0.2654 0.1123 2.364 0.02171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.173 on 54 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3917, Adjusted R-squared: 0.3466
## F-statistic: 8.691 on 4 and 54 DF, p-value: 1.72e-05
library(broom)
glance(model.2)
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.392 0.347 1.17 8.69 1.72e-5 4 -90.5 193. 206. 74.4 54
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
## # ℹ Use `colnames()` to see all variable names
Model 3
.3 <- lm(formula = LSmean.Y ~ mcSleepEn*mcPhoneWO*mcSocMedRev, data = LifeSat.df)
modelsummary(model.3)
##
## Call:
## lm(formula = LSmean.Y ~ mcSleepEn * mcPhoneWO * mcSocMedRev,
## data = LifeSat.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0545 -0.9749 0.1459 0.7049 2.7455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.208014 0.174425 24.125 <2e-16 ***
## mcSleepEn 0.462880 0.173856 2.662 0.0104 *
## mcPhoneWO 0.375234 0.164898 2.276 0.0271 *
## mcSocMedRev 0.262791 0.171552 1.532 0.1317
## mcSleepEn:mcPhoneWO 0.002235 0.165974 0.013 0.9893
## mcSleepEn:mcSocMedRev 0.054906 0.174936 0.314 0.7549
## mcPhoneWO:mcSocMedRev 0.292061 0.147520 1.980 0.0531 .
## mcSleepEn:mcPhoneWO:mcSocMedRev 0.021219 0.136057 0.156 0.8767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.206 on 51 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3935, Adjusted R-squared: 0.3102
## F-statistic: 4.726 on 7 and 51 DF, p-value: 0.0003805
library(broom)
glance(model.3)
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.393 0.310 1.21 4.73 3.81e-4 7 -90.5 199. 218. 74.1 51
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
## # ℹ Use `colnames()` to see all variable names
2. Evaluate the models: Give preference to simpler models and lower BIC scores.
sem()
function in the lavaan package1. Create multiple models of your outcome variable and calculate the BIC: The first model below is identical to our previous mediation model. The predictor and mediator have been switched in the second mediation model.
With each model, the fitMeasures()
function in the
lavaan package is used to calculate the model’s BIC.
Model 1
.1 <- "
medMod LSmean.Y ~ SocialMediaRev + SleepEn
SleepEn ~ SocialMediaRev
"
library(lavaan)
.1 <- sem(medMod.1, data = LifeSat.df)
mediationsummary(mediation.1)
## lavaan 0.6-12 ended normally after 1 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Used Total
## Number of observations 60 62
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## LSmean.Y ~
## SocialMediaRev 0.245 0.151 1.628 0.104
## SleepEn 0.638 0.162 3.935 0.000
## SleepEn ~
## SocialMediaRev 0.243 0.116 2.100 0.036
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .LSmean.Y 1.586 0.290 5.477 0.000
## .SleepEn 1.005 0.183 5.477 0.000
fitMeasures(mediation.1, "bic") # calculate BIC
## bic
## 389.007
Model 2
.2 <- "
medMod LSmean.Y ~ SocialMediaRev + SleepEn
SocialMediaRev ~ SleepEn
"
library(lavaan)
.2 <- sem(medMod.2, data = LifeSat.df)
mediationsummary(mediation.2)
## lavaan 0.6-12 ended normally after 1 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Used Total
## Number of observations 60 62
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## LSmean.Y ~
## SocialMediaRev 0.245 0.151 1.628 0.104
## SleepEn 0.638 0.162 3.935 0.000
## SocialMediaRev ~
## SleepEn 0.282 0.134 2.100 0.036
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .LSmean.Y 1.586 0.290 5.477 0.000
## .SocialMediaRev 1.164 0.213 5.477 0.000
fitMeasures(mediation.2, "bic") # calculate BIC
## bic
## 397.826
2. Evaluate the models: Give preference to simpler models and lower BIC scores.
Gender affiliation
I feel I get enough sleep
I can go a long time without looking at my phone
Social media is part of your everyday activity [note the section on reverse scoring]
Life Satisfaction Items 1-5
In most ways my life is close to my ideal. [1]
The conditions of
my life are excellent. [2]
I am satisfied with my life. [3]
So
far I have gotten the important things I want in life. [4]
If I
could live my life over, I would change almost nothing. [5]
In cases where you’re working with a data frame that has variables you don’t intend to use, it may be helpful to create a new data set with a subset of the variables.
Begin by getting the column names and numbers (index):
data.frame(colnames(LifeSat.df))
## colnames.LifeSat.df.
## 1 P
## 2 Gender
## 3 LS1
## 4 LS2
## 5 LS3
## 6 LS4
## 7 LS5
## 8 PhoneWO
## 9 SleepEn
## 10 SocialMedia
## 11 LSmean.Y
## 12 SocialMediaRev
## 13 mcPhoneWO
## 14 mcSleepEn
## 15 mcSocMedRev
## 16 mcPhoneWOmedian
## 17 mcPhoneWOsd
Subset the data by saving the variables you will use into a new data frame. Using the code below, list the variables you want to add to the new data frame in c(). List individual columns separate by a comma, or list a range of columns using a colon. In this example, one predictor variables, SleepEn, and the outcome variable scores on life satisfaction (LS1 - LS5)
<- LifeSat.df[c(3:7,9)] # list variables to keep
LifeSatShort.df head(LifeSatShort.df, 5)
## LS1 LS2 LS3 LS4 LS5 SleepEn
## 1 6 6 6 5 6 4
## 2 2 1 2 2 2 1
## 3 7 7 7 7 7 4
## 4 7 5 7 6 7 NA
## 5 6 6 6 6 6 4
The following commands use the dplyr package to rename specific variables.
library(dplyr)
<- rename(LifeSatShort.df, # data frame
LifeSatShort.df EnoughSleep = SleepEn) # new variable name = existing variable Name)
head(LifeSatShort.df, 5)
## LS1 LS2 LS3 LS4 LS5 EnoughSleep
## 1 6 6 6 5 6 4
## 2 2 1 2 2 2 1
## 3 7 7 7 7 7 4
## 4 7 5 7 6 7 NA
## 5 6 6 6 6 6 4