rnorm(n = 5, mean = 10, sd = 2)
[1] 8.581553 15.015173 10.039090 10.262454 8.402416
Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”
We use rnorm
to simulate data. Three arguments: number of draws, mean, standard deviation:
We made up this data:
fake_election = tibble(party_share = rnorm(n = 500, mean = 50, sd = 5),
funding = rnorm(n = 500, mean = 20000, sd = 4000) + 2000 * party_share)
fake_election
# A tibble: 500 × 2
party_share funding
<dbl> <dbl>
1 48.1 122947.
2 49.2 118628.
3 47.1 114853.
4 53.2 132646.
5 49.8 120458.
6 53.6 126049.
7 45.5 115563.
8 45.7 120630.
9 45.9 118284.
10 53.1 125523.
# ℹ 490 more rows
We can plot it:
ggplot(fake_election, aes(x = party_share, y = funding)) + geom_point() + geom_smooth(method = "lm")
Notice we made the causal effect equal 2000 dollars per percent of the vote won. We can estimate this and get pretty close using OLS:
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 19261. 1747. 11.0 1.93e- 25
2 party_share 2017. 34.9 57.9 3.55e-223
It’s close but not perfect because there is “noise” in our data. These numbers are randomly generated!
Here we want to make it so a third variable, “the south”, confounds the relationship between the number of waffle houses in a state and the divorce rate:
fake = tibble(south = sample(c(0, 1), size = 50, replace = TRUE),
waffle = rnorm(n = 50, mean = 20, sd = 4) + 10 * south,
divorce = rnorm(n = 50, mean = 20, sd = 2) + 8 * south)
fake
# A tibble: 50 × 3
south waffle divorce
<dbl> <dbl> <dbl>
1 1 31.7 30.2
2 1 24.1 24.7
3 1 26.2 30.3
4 0 18.4 19.0
5 0 23.3 21.6
6 1 33.2 26.1
7 1 30.7 28.1
8 0 11.7 19.9
9 0 22.1 15.5
10 0 19.0 21.3
# ℹ 40 more rows
We can plot:
We can model to retrieve the confounded estimate: