rnorm(n = 5, mean = 10, sd = 2)
[1] 12.597806 9.037005 8.582153 10.629605 9.079209
Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”
We use rnorm
to simulate data. Three arguments: number of draws, mean, standard deviation:
We made up this data:
fake_election = tibble(party_share = rnorm(n = 500, mean = 50, sd = 5),
funding = rnorm(n = 500, mean = 20000, sd = 4000) + 2000 * party_share)
fake_election
# A tibble: 500 × 2
party_share funding
<dbl> <dbl>
1 53.0 123590.
2 57.0 136940.
3 52.3 123661.
4 43.4 104649.
5 46.2 117465.
6 46.1 115563.
7 46.7 117569.
8 41.2 106682.
9 48.9 113562.
10 52.4 125225.
# ℹ 490 more rows
We can plot it:
ggplot(fake_election, aes(x = party_share, y = funding)) + geom_point() + geom_smooth(method = "lm")
Notice we made the causal effect equal 2000 dollars per percent of the vote won. We can estimate this and get pretty close using OLS:
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19344. 1815. 10.7 5.05e- 24
2 party_share 2016. 36.1 55.8 1.73e-216
It’s close but not perfect because there is “noise” in our data. These numbers are randomly generated!
Here we want to make it so a third variable, “the south”, confounds the relationship between the number of waffle houses in a state and the divorce rate:
fake = tibble(south = sample(c(0, 1), size = 50, replace = TRUE),
waffle = rnorm(n = 50, mean = 20, sd = 4) + 10 * south,
divorce = rnorm(n = 50, mean = 20, sd = 2) + 8 * south)
fake
# A tibble: 50 × 3
south waffle divorce
<dbl> <dbl> <dbl>
1 0 27.3 18.1
2 1 27.2 27.6
3 1 32.8 30.9
4 0 19.3 16.5
5 1 25.3 28.6
6 0 18.6 21.2
7 0 17.2 17.3
8 0 21.1 21.0
9 1 31.7 28.1
10 0 15.0 20.9
# ℹ 40 more rows
We can plot:
We can model to retrieve the confounded estimate: