1 Correlational research design

Participants: Students (86 female, 42 male) enrolled in an introductory psychology course completed an online survey about their attitudes towards climate change.

Predictor Variables: Participants completed survey items on gender identity (Gender), political affiliation (Political), and perceived knowledge of global warming (KnowClim). Participants also completed the Climate Change Denial Scale with three survey items on 1) perceived climate change harm (ClimHarm), 2) beliefs about the cause of global warming (CauseClim), and 3) views about the scientific consensus on global warming (SciClim) (see Appendix A).

Outcome Variable: Participants completed one survey item about climate change worry (WorryClim.Y):
I have worrisome thoughts about global climate change (1 = Never, 5 = Almost Always).

2 Data Wrangling

2.1 Load and view the data

climAtt.df  <- read.csv("https://andrewebrandt.github.io/object/datasets/climAtt.csv", header = TRUE)
head(climAtt.df, 1)

##   P Gender Political VoluntHr Alcohol CRT Empathy Extrovert ClimHarm CauseClim
## 1 1   Male  No party        8       3   1       3         7        5         1
##   SciClim KnowClim WorryClim.Y
## 1       1        4           4

2.2 Subset the variables of interest into a new data frame

Begin by getting the column names and numbers (index):

data.frame(colnames(climAtt.df))

##    colnames.climAtt.df.
## 1                     P
## 2                Gender
## 3             Political
## 4              VoluntHr
## 5               Alcohol
## 6                   CRT
## 7               Empathy
## 8             Extrovert
## 9              ClimHarm
## 10            CauseClim
## 11              SciClim
## 12             KnowClim
## 13          WorryClim.Y

Subset the data by keeping only those variables you will use. In this case, keep six predictor variables, Gender, Political, ClimHarm, CauseClim, SciClim, KnowClim, and the outcome variable, WorryClim.Y:

climAtt.df <- climAtt.df[c(2:3,9:13)]  # list variables to keep
head(climAtt.df, 1)

##   Gender Political ClimHarm CauseClim SciClim KnowClim WorryClim.Y
## 1   Male  No party        5         1       1        4           4

2.3 Rename variables (data frame columns)

Let’s modify the variable names ClimHarm, CauseClim, and SciClim to indicate they are three items that belong to the same climate denial scale (see below).

library(dplyr)
climAtt.df <- rename(climAtt.df,                   # data frame
                     ClimHarm.1 = ClimHarm,     # new variable name = existing variable Name
                     CauseClim.2 = CauseClim, 
                     SciClim.3 = SciClim)
head(climAtt.df, 1)

##   Gender Political ClimHarm.1 CauseClim.2 SciClim.3 KnowClim WorryClim.Y
## 1   Male  No party          5           1         1        4           4

2.4 Calculate one score for “Climate Change Denial”

To find the Climate Change Denial Scale (CCDS) scores, re-score each item using the provided weights (in parentheses after each scale item) and sum the weighted score for all three items.

Item #3 ClimHarm.1: Mutate the original responses (1 to 5) on ClimHarm.1 to the stated weights (0, 1, 2).

Please indicate your belief about how much harm climate change is likely to cause: People all over the world

1=No harm (2)
2=Slight harm (2)
3=Moderate harm (1)
4=Considerable harm (0)
5=Great harm (0)

library(dplyr)
climAtt.df <-                                        # update existing data frame
  mutate(climAtt.df, Den.ClimHarm.1 =                # create a new variable, Den.ClimHarm.1
                     case_when(ClimHarm.1 == 1 ~ 2, 
                               ClimHarm.1 == 2 ~ 2,
                               ClimHarm.1 == 3 ~ 1,
                               ClimHarm.1 == 4 ~ 0,
                               ClimHarm.1 == 5 ~ 0))

Item #4 CauseClim.2: Mutate the original responses (1 or 2) on CauseClim.2 to the stated weights (0, 1, 2).

From what you have heard or read, do you believe increases in the Earth’s temperature over the last century are due more to:

1=The effects of pollution from human activities (0)
2=Natural changes in the environment that are not due to human activities (2)

library(dplyr)
climAtt.df <-                                         # update existing data frame
  mutate(climAtt.df, Den.CauseClim.2 =                # create a new variable, Den.CauseClim.2
                     case_when(CauseClim.2 == 1 ~ 0, 
                               CauseClim.2 == 2 ~ 2))

Item #5 SciClim.3: Mutate the original responses (1 to 3) on SciClim.3 to the stated weights (0, 1, 2). Just your impression, which one of the following statements do you think is most accurate?

1=Most scientists believe that global warming is occurring (0)
2=Most scientists believe that global warming is not occurring (2)
3=Most scientists are unsure about whether global warming is occurring or not (1)

library(dplyr)
climAtt.df <-                                        # update existing data frame
  mutate(climAtt.df, Den.SciClim.3 =                 # create a new variable, Den.SciClim.3
                     case_when(SciClim.3 == 1 ~ 0, 
                               SciClim.3 == 2 ~ 2,
                               SciClim.3 == 3 ~ 1))

Finally, create a new variable for the final CCDS score (0 to 6), which is simply the sum of the three weighted items created above.

library(dplyr)
climAtt.df <- 
 mutate(climAtt.df, CCDS = rowSums(across(Den.ClimHarm.1:Den.SciClim.3)))

2.5 Evaluate the data

Categorical variables: Create a descriptive summary table for climate worry grouped by one predictor:

library(psych)
summary.gender <- describeBy(
  climAtt.df$WorryClim.Y,         # outcome variable
  climAtt.df$Gender,              # grouping variable
  skew= FALSE, range = FALSE,     # omit skew and range info
  mat = TRUE,                     # matrix format
  digits = 2)                     # round values to 2 digits
summary.gender                    # show descriptives

##     item group1 vars  n mean   sd   se
## X11    1 Female    1 86 3.33 0.95 0.10
## X12    2   Male    1 42 2.74 1.06 0.16

summary.polit <- describeBy(
  climAtt.df$WorryClim.Y,         # outcome variable
  climAtt.df$Political,           # grouping variable
  skew= FALSE, range = FALSE,     # omit skew and range info
  mat = TRUE,                     # matrix format
  digits = 2)                     # round values to 2 digits
summary.polit                     # show descriptives

##     item      group1 vars  n mean   sd   se
## X11    1    Democrat    1 57 3.46 0.87 0.11
## X12    2 Independent    1 16 3.38 1.02 0.26
## X13    3 Libertarian    1  3 3.00 1.00 0.58
## X14    4    No party    1 30 2.97 1.00 0.18
## X15    5 Other party    1  2 3.50 0.71 0.50
## X16    6  Republican    1 20 2.25 1.02 0.23

Delete scores: The results show the Political variable contains two levels (Libertarian and Other party) with too few observations (n) to make meaningful comparisons to the other levels. You can remove these scores so they are not used in plots or analyses.

library(dplyr)
climAtt.df <- climAtt.df %>% 
  mutate(Political = na_if(Political, "Other party")) %>% 
  mutate(Political = na_if(Political, "Libertarian"))

summary.polit <- describeBy(
  WorryClim.Y ~ Political,        # Y ~ X
  data = climAtt.df,              # data = data frame
  skew= FALSE, range = FALSE,     # omit skew and range info
  mat = TRUE,                     # matrix format
  digits = 2)                     # round values to 2 digits
summary.polit                     # show descriptives

##              item      group1 vars  n mean   sd   se
## WorryClim.Y1    1    Democrat    1 57 3.46 0.87 0.11
## WorryClim.Y2    2 Independent    1 16 3.38 1.02 0.26
## WorryClim.Y3    3    No party    1 30 2.97 1.00 0.18
## WorryClim.Y4    4  Republican    1 20 2.25 1.02 0.23

Clustered categorical variables: Create a descriptive summary table for two predictors:

library(dplyr)
summary.polit.gen <- describeBy(
  WorryClim.Y ~ Political + Gender,  # Y ~ X1 + X2
  data = climAtt.df,                 # data = data frame
  skew= FALSE, range = FALSE,        # omit skew and range info
  mat = TRUE,                        # matrix format
  digits = 2)                        # round values to 2 digits
summary.polit.gen                    # show descriptives

##              item      group1 group2 vars  n mean   sd   se
## WorryClim.Y1    1    Democrat Female    1 38 3.58 0.86 0.14
## WorryClim.Y2    2 Independent Female    1 14 3.36 1.08 0.29
## WorryClim.Y3    3    No party Female    1 22 3.14 0.94 0.20
## WorryClim.Y4    4  Republican Female    1  8 2.62 0.92 0.32
## WorryClim.Y5    5    Democrat   Male    1 19 3.21 0.85 0.20
## WorryClim.Y6    6 Independent   Male    1  2 3.50 0.71 0.50
## WorryClim.Y7    7    No party   Male    1  8 2.50 1.07 0.38
## WorryClim.Y8    8  Republican   Male    1 12 2.00 1.04 0.30

Continuous variables: Descriptive summary table

library(psych)
summary.cont <- climAtt.df %>%
    select(CCDS, KnowClim, WorryClim.Y) %>%
    psych::describe(skew= FALSE, range = FALSE) %>%
    as_tibble(rownames = "Variables")
summary.cont

## # A tibble: 3 × 6
##   Variables    vars     n  mean    sd     se
##   <chr>       <int> <dbl> <dbl> <dbl>  <dbl>
## 1 CCDS            1   127 0.661 1.32  0.117 
## 2 KnowClim        2   127 2.72  0.870 0.0772
## 3 WorryClim.Y     3   128 3.13  1.02  0.0904

Correlation matrix: Continuous predictor variables and outcome variable

library(dplyr)
cor(select(climAtt.df, CCDS, KnowClim, WorryClim.Y) , use = "complete.obs")

##                   CCDS   KnowClim WorryClim.Y
## CCDS         1.0000000 -0.3454072  -0.5123817
## KnowClim    -0.3454072  1.0000000   0.5394312
## WorryClim.Y -0.5123817  0.5394312   1.0000000

3 Visualize the relationship between variables

3.1 Strip or bar charts for categorical predictor variables

Use a strip chart to plot worry scores by a single predictor, e.g., political affiliation. A symbol (e.g., black line) can be added to indicate the mean score of each group.

library(ggplot2)
ggplot(climAtt.df, aes(x = Political, y = WorryClim.Y, color = Political)) + 
  geom_jitter(position = position_jitter(0.05), shape = 16, size = 2, alpha = 0.5) + 
  stat_summary(fun = mean, geom = "point", shape = 95, size = 8, color = "black") +
  scale_x_discrete(name = "Political Affiliation",
    limits = c("Democrat", "Independent", "Republican", "No party")) +
  scale_y_continuous(name = "Mean Worry about Climate Change") +
  ggtitle("Worry about Climate Change by Political Affiliation") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_color_brewer(palette="Dark2")

Use a cluster bar chart to plot worry scores by two predictors, e.g., gender and political affiliation:

library(ggplot2)
ggplot(summary.polit.gen, aes(x = group1, y = mean, fill = group2)) +
  geom_col( 
    position = position_dodge(0.6),
    width = 0.5,
    color = "black") +
  geom_errorbar(aes(ymin = mean-se, ymax = mean+se),
    position = position_dodge(0.6),
    color = "black", 
    width = .1) + 
  scale_x_discrete(name = "Political Affiliation", limits = c("Democrat", "Independent", "Republican", "No party")) +
  scale_y_continuous(name = "Mean Worry about Climate Change") +
  ggtitle("Worry about Climate Change by Gender and Political Affiliation") +
  theme_minimal() +
  theme(legend.position = "top") + 
  scale_fill_manual(values = c(hsv(0.6, .6, 1), hsv(0.4, .6, 1)))

3.2 Scatterplots for continuous predictor variables

Create a scatterplot with a best fit line for each combination of predictor (X) and outcome (Y) variable.

library(ggplot2)
library(scales)
ggplot(climAtt.df, aes(x = CCDS, y = WorryClim.Y)) +          
  geom_point(position = position_jitter(0.05), size = 1, alpha = 0.5) +    
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) +        
  scale_x_continuous(name = "Climate Change Denial", breaks = pretty_breaks(7)) +   
  scale_y_continuous(name = "Worry about Climate Change") +         
  ggtitle("Climate Worry by Climate Change Denial") + 
  theme_minimal() +
  theme(text = element_text(size = 12))

ggplot(climAtt.df, aes(x = KnowClim, y = WorryClim.Y)) +        
  geom_point(position = position_jitter(0.05), size = 1, alpha = 0.5) + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) +
  scale_x_continuous(name = "Knowledge about Climate Change",
                     breaks  = c(1,2,3,4,5), 
                     labels = c("Not at all", "Slightly", "Moderately", "Considerably", "Completely")) +                scale_y_continuous(name = "Worry about Climate Change") +     
  ggtitle("Climate Worry by Climate Change Knowledge") + 
  theme_minimal() +
  theme(text = element_text(size = 12))

Add a “color” element a ggplot aesthetic (aes) to plot the X - Y relationship at each level of a categorical predictor (X2).

library(ggplot2)
ggplot(climAtt.df, aes(x = KnowClim, y = WorryClim.Y, color = Gender)) + # color = groups   
  geom_point(position = position_jitter(0.05), size = 1, alpha = 0.5) +   
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) +    
  scale_x_continuous(name = "Knowledge about Climate Change",
                     breaks  = c(1,2,3,4,5), 
                     labels = c("Not at all", "Slightly", "Moderately", "Considerably", "Completely")) + 
  scale_y_continuous(name = "Worry about Climate Change") +    
  ggtitle("Climate Worry by Climate Change Knowledge") + 
  theme_minimal() +
  theme(text = element_text(size = 12))  +
  scale_color_brewer(palette="Dark2")        # many "color pallets" available for ggplot2

For more scatterplot options, see this tutorial by STHDA.

4 Multiple regression model

The population model for climate worry (Y) and three predictors (X), gender, climate denial, and climate knowledge is:

\(Y = \beta_0 + \beta_{gender}X_{gender} + \beta_{denial}X_{denial} + \beta_{knowledge}X_{knowledge} + \epsilon\)

The sample data on these factors is used to obtain the parameter estimates \(b_0\), \(b_{gender}\), \(b_{denial}\), and \(b_{knowledge}\):

\(\hat{Y} = b_0 + b_{gender}X_{gender} + b_{denial}X_{denial} + b_{knowledge}X_{knowledge}\)

Where \(\hat{Y}\) is the predicted value of \(Y\), \(b_0\) is the intercept, and each regression coefficient (\(b\)) is the slope of the relationship (rate of change in \(\hat{Y}\) with changes in \(X\)) when holding all other \(X\) values constant.

For example, given a 1 unit shift in gender (\(X_{gender}\), from female to male), \(b_{gender}\) is the change in predicted climate worry (\(\hat{Y}\)), assuming climate denial (\(X_{denial}\)) and climate knowledge (\(X_{knowledge}\)) remain unchanged.

4.1 Statsitical significance tests

The analysis of multiple regression produces two types of inferential test.

The F-statistic, \(F = MS_{Regression} / MS_{Residual}\), tests an omnibus null hypothesis stating that all predictors (X) and the outcome (Y) are not related in the population, that is, all regression coefficients are equal to zero: \(H_0: \beta_{gender} = \beta_{denial} = \beta_{knowledge} = 0\)

The t-statistic, \(t = b_1 / SE_{b_1}\), tests a null hypothesis stating that one predictor (X) and the outcome (Y) are not related in the population, that is, the regression coefficient is equal to zero: \(H_0: \beta_{gender} = 0\), \(H_0: \beta_{denial} = 0\), and \(H_0: \beta_{knowledge} = 0\)

Make a statistical decision to reject a null hypothesis when \(\text{p-value} \le .05\). Remember that p is the probability that the statistic value would be obtained if the null hypothesis is true.

4.2 Mupltiple regression analysis

Use the lm() function to conduct an analysis of multiple regression:

# ListOfResults <- lm(formula = Outcome ~ Predictor1 + Predictor2 + Predictor3, data = data frame)
model.worry <- lm(formula = WorryClim.Y ~ Gender + CCDS + KnowClim, data = climAtt.df)
summary(model.worry)

## 
## Call:
## lm(formula = WorryClim.Y ~ Gender + CCDS + KnowClim, data = climAtt.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1719 -0.5644 -0.0644  0.4098  1.5895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.01287    0.25259   7.969 9.23e-13 ***
## GenderMale  -0.46977    0.15159  -3.099  0.00241 ** 
## CCDS        -0.23593    0.05788  -4.076 8.15e-05 ***
## KnowClim     0.52576    0.08474   6.205 7.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7674 on 123 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.441 
## F-statistic: 34.13 on 3 and 123 DF,  p-value: 4.025e-16

# note, the above model does not test for interactions between predictors
# connect predictors in the model with * to test individual factors and interactions
  # model.worry <- lm(formula = WorryClim.Y ~ Gender + CCDS*KnowClim, data = climAtt.df)
# connect predictors in the model with : to test interactions only
  # model.worry <- lm(formula = WorryClim.Y ~ Gender + CCDS:KnowClim, data = climAtt.df)

# show the complete source table for the analysis of regression (F-test)
# library(car)
# Anova(model.worry)

The regression coefficients \(b_{gender} = -0.47\), \(b_{denial} = -0.24\), and \(b_{knowledge} = 0.53\) are shown in the “Estimate” column. The regression coefficient divided by its standard error (“Std. Error” column) gives the t-test value (“t value” column). The p-value for the t-test on each coefficient is shown in the “Pr(>t)” column.

Also shown is the residual standard error = 0.77 (standard error of estimate), adjusted R squared = 0.44 (coefficient of determination adjusted for the number of predictors), and the F-test on the omnibus null hypothesis that all regression coefficients equal to 0, \(F(3,123) = 34.13\), \(p = 4.025^{-16}\).

4.3 Model assumptions and considerations

Plot the residuals to identify possible deviations from the assumptions about linearity and homogeneity of variance. Use this brief guide on assessing residual plots: Link

res.worry <- resid(model.worry)       # get list of residuals
plot(fitted(model.worry), res.worry)  # produce residual plot
abline(0,0)                           # add horizontal line at Y = 0

Create a quantile-quantile (QQ) plot to identify possible deviations from the normality assumption. Use this brief guide on assessing QQ plots: Link

model.residuals = rstandard(model.worry) # compute the standardized residuals
qqnorm(model.residuals,                  # plot residuals
       main = "QQ Polt")                 # plot title
qqline(model.residuals)                  # add QQ line

Multicollinearity is the degree of intercorrelation between predictor variables in a multiple regression model. Some relatedness between predictors is expected, but high intercorrelation indicates that one or more predictor variables is adding redundant information to the model. When redundant predictors are identified, factors may be removed, combined, or centered as a way to remedy the problem.

Check for redundant predictor variables in the model using variance inflation factors (VIF). A VIF = 1 indicates that a factor is uncorrelated with other factors, higher values indicate greater correlation (VIF > 2 may indicate a problem). Use this brief guide on assessing variance inflation factors: Link

library(car)
vif(model.worry)

##   Gender     CCDS KnowClim 
## 1.097070 1.242607 1.162743

5 Reporting the results

In an APA style paper, the results could be reported as follows:

An analysis of multiple linear regression showed that gender, climate denial, and climate knowledge predicted a significant amount of variance in climate worry, \(F(3,123) = 34.13\), \(p < .001\), \(R_{adj.}^2 = 0.441\). Male participants reported significantly less climate worry than female participants, \(b = -0.47\), \(SE = 0.15\), \(p = .002\). Participants reported significantly greater climate worry as knowledge about climate change increased, \(b = 0.53\), \(SE = 0.08\), \(p < .001\), and as climate denial decreased, \(b = -0.24\), \(SE = 0.06\), \(p < .001\).

6 Appendix A: Full survey quesitons

Self-identified gender

Female
Male
Non-binary

When it comes to political parties in the United States, how would you describe yourself?

Democrat
Independent
Libertarian
Republican
Other party
No party

Please indicate your belief about how much harm climate change is likely to cause: People all over the world

1=No harm
2=Slight harm
3=Moderate harm
4=Considerable harm
5=Great harm

From what you have heard or read, do you believe increases in the Earth’s temperature over the last century are due more to:

1=The effects of pollution from human activities
2=Natural changes in the environment that are not due to human activities

Just your impression, which one of the following statements do you think is most accurate?

1=Most scientists believe that global warming is occurring
2=Most scientists believe that global warming is not occurring
3=Most scientists are unsure about whether global warming is occurring or not

Compared to other people, how knowledgeable are you about the issue of global warming?

1=Not at all
2=Slightly
3=Moderately
4=Considerably
5=Completely

I have worrisome thoughts about global climate change.

1=Never
2=Rarely
3=Sometimes
4=Often
5=Almost Always

Multiple Regression in R

With categorical and continuous predictor variables

Andrew Brandt, PhD

Experimental Psychologist

User Experience Researcher