Correlation and regression

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

Drawing lines (geom_smooth)

Why draw trend lines? Trend lines give us a good, educated guess as to what the value of a Y variable is given some value of X. We can draw a trend line (or line of best fit) using geom_smooth, as below. Notice method = "lm".

# libraries
library(tidyverse)

# mtcars
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() +
  geom_smooth(method = "lm")

ggcorrplot

# libraries
library(tidyverse)
library(socviz)
library(fivethirtyeight)
library(gapminder)
library(nycflights13)
library(ggcorrplot)
library(juanr)
library(palmerpenguins)

Look at the correlations:

# switch out gapminder with a dataset you want below
therm %>%
  # correlation only works with numeric columns; keep only those
  select(where(is.numeric)) %>%
  # the cor() function doesn't take NA; drop them all
  drop_na() %>%
  # get the correlation
  cor() %>%
  # plot the correlation
  ggcorrplot(lab = TRUE)

Draw the line

# draw a line: alter the intercept and slope in geom_abline()
# to draw the line
  ggplot() + geom_abline(intercept = 1, slope = 2, size = 1) +
  # change the limits on the x and y-axis
  scale_x_continuous(limits = c(-10, 10)) +
  scale_y_continuous(limits = c(-10, 10)) +
  # add a vertical and horizontal line at 0
  geom_hline(yintercept = 0, lty = 2) +
  geom_vline(xintercept = 0, lty = 2) +
  theme_bw()

Convince yourself

Make the scatterplot:

# convince yourself about the line of best fit: run this code below
# set the seed
set.seed(1990)

# make the fake data
df = tibble(x = rnorm(50, mean = 10),
            y = 3 + 2*x + rnorm(50))

# line of best fit?
model = lm(y ~ x, df)
true = tibble(true_intercept = coef(model)[1],
              true_slope = coef(model)[2])

ggplot(data = df, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm")

IR Econ

The plot

ir_1959 = trade %>%
  filter(year == 2008)


ggplot(ir_1959, aes(x = imports, y = exports,
                    label = country)) + geom_point() + geom_smooth(method = "lm") +
  geom_text()

Estimate a model, and interpret:

library(broom)

igo_pop = lm(exports ~ pop, data = trade)

tidy(igo_pop)
## # A tibble: 2 × 5
##   term            estimate     std.error statistic  p.value
##   <chr>              <dbl>         <dbl>     <dbl>    <dbl>
## 1 (Intercept) 13322.       1023.              13.0 1.81e-38
## 2 pop             0.000393    0.00000920      42.7 0

Penguins regression:

penguins_model = lm(body_mass_g ~ species,
                    data = penguins)

tidy(penguins_model)
## # A tibble: 3 × 5
##   term             estimate std.error statistic   p.value
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)        3701.       37.6    98.4   2.49e-251
## 2 speciesChinstrap     32.4      67.5     0.480 6.31e-  1
## 3 speciesGentoo      1375.       56.1    24.5   5.42e- 77

Interpretation:

  • Chinstrap penguins weigh 32 more grams, on average, than Adelie penguins.
  • Gentoo penguins weigh 1,375 more grams, on average, than Adelie penguins.
  • Adelie penguins weigh, on average, 3,700 grams.

another one:

lm(tvhours ~ race, data = gss_cat) %>%
  tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  2.76       0.0792    34.9   3.90e-253
## 2 raceBlack    1.42       0.100     14.1   5.90e- 45
## 3 raceWhite    0.00894    0.0838     0.107 9.15e-  1

Interpretation:

  • Black respondents watch 1.42 more hours of tv, on average, than respondents who identify as “Other”.
  • White respondents watch .009 more hours of tv, on average, than respondents who identify as “Other”.
  • Respondents who identify as “Other” watch, on average, 2.76 hours of TV.