Causality

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

class: make up a relationship

We use rnorm to simulate data. Three arguments: number of draws, mean, standard deviation:

rnorm(n = 5, mean = 10, sd = 2)
[1]  8.581553 15.015173 10.039090 10.262454  8.402416

We made up this data:

fake_election = tibble(party_share = rnorm(n = 500, mean = 50, sd = 5),
                       funding = rnorm(n = 500, mean = 20000, sd = 4000) + 2000 * party_share)
fake_election
# A tibble: 500 × 2
   party_share funding
         <dbl>   <dbl>
 1        48.1 122947.
 2        49.2 118628.
 3        47.1 114853.
 4        53.2 132646.
 5        49.8 120458.
 6        53.6 126049.
 7        45.5 115563.
 8        45.7 120630.
 9        45.9 118284.
10        53.1 125523.
# ℹ 490 more rows

We can plot it:

ggplot(fake_election, aes(x = party_share, y = funding)) + geom_point() + geom_smooth(method = "lm")

Notice we made the causal effect equal 2000 dollars per percent of the vote won. We can estimate this and get pretty close using OLS:

mod = lm(funding ~ party_share, data = fake_election)
tidy(mod)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   19261.    1747.       11.0 1.93e- 25
2 party_share    2017.      34.9      57.9 3.55e-223

It’s close but not perfect because there is “noise” in our data. These numbers are randomly generated!

class: confounds

Here we want to make it so a third variable, “the south”, confounds the relationship between the number of waffle houses in a state and the divorce rate:

fake = tibble(south = sample(c(0, 1), size = 50, replace = TRUE),
              waffle = rnorm(n = 50, mean = 20, sd = 4) + 10 * south,
              divorce = rnorm(n = 50, mean = 20, sd = 2) + 8 * south)
fake
# A tibble: 50 × 3
   south waffle divorce
   <dbl>  <dbl>   <dbl>
 1     1   31.7    30.2
 2     1   24.1    24.7
 3     1   26.2    30.3
 4     0   18.4    19.0
 5     0   23.3    21.6
 6     1   33.2    26.1
 7     1   30.7    28.1
 8     0   11.7    19.9
 9     0   22.1    15.5
10     0   19.0    21.3
# ℹ 40 more rows

We can plot:

ggplot(fake, (aes(x = waffle, y = divorce))) + geom_point() + geom_smooth(method = "lm")

We can model to retrieve the confounded estimate:

lm(divorce ~ waffle, data = fake) |> broom::tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   12.5      2.07        6.03 0.000000226
2 waffle         0.443    0.0823      5.38 0.00000218