Natural Experiments
In-class example
Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”.
Closing backdoors with controls
library(ggdag)
library(tidyverse)
library(broom)
# set seed
set.seed(1990)
Remember that one of the lessons from our week on DAGs is that in order to estimate the effect of X on Y, sometimes we need to control for some variables and avoid controlling for others.
Dealing with forks
Here’s an example dealing with forks. The classic fork looks like this:
dag = dagify(Y ~ Z,
X ~ Z,
exposure = "X",
outcome = "Y")
ggdag(dag) + theme_dag_blank()
Notice that we need to control for Z:
ggdag_adjustment_set(dag)
Let’s simulate some fake data from this DAG. Notice that X does not cause Y:
# fake data
fake = tibble(person = 1:200,
z = rnorm(200),
x = rnorm(200) + 2*z,
y = rnorm(200) - 3*z)
And yet when we look at X and Y there’s a relationship!
ggplot(fake, aes(x = x, y = y)) +
geom_point() + geom_smooth(method = "lm")
We can see this relationship in regression too:
# naive regression
lm(y ~ x, data = fake) %>% tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0655 0.125 0.523 6.01e- 1
## 2 x -1.20 0.0524 -23.0 1.07e-57
But notice that when we correcty control for Z, the estimate on X goes to 0:
# with controls
lm(y ~ x + z, data = fake) %>% tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0515 0.0741 0.694 4.88e- 1
## 2 x -0.0245 0.0689 -0.356 7.22e- 1
## 3 z -2.99 0.156 -19.1 7.53e-47
Dealing with pipes
Here’s an example dealing with pipes The classic pipe looks like this:
dag = dagify(Y ~ Z,
Z ~ X,
exposure = "X",
outcome = "Y")
ggdag(dag) + theme_dag_blank()
Notice that we shouldn’t control for Z:
ggdag_adjustment_set(dag)
Let’s simulate some fake data from this DAG. Notice that X -> Y -> Z:
# fake data
fake = tibble(person = 1:200,
x = rnorm(200),
z = rnorm(200) + 2*x,
y = rnorm(200) - 3*z)
When we look at X and Y, there’s a relationship:
ggplot(fake, aes(x = x, y = y)) +
geom_point() + geom_smooth(method = "lm")
We can see this relationship in regression too:
lm(y ~ x, data = fake) %>% tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.302 0.217 1.39 1.65e- 1
## 2 x -6.17 0.235 -26.3 1.81e-66
But notice that when we incorrectly control for Z, the estimate on X goes to 0:
# with controls
lm(y ~ x + z, data = fake) %>% tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00961 0.0698 0.138 8.91e- 1
## 2 x -0.0667 0.164 -0.406 6.85e- 1
## 3 z -2.99 0.0718 -41.7 1.03e-99
Dealing with colliders
Here’s an example dealing with colliders The classic collider looks like this:
dag = dagify(Z ~ Y + X,
exposure = "X",
outcome = "Y")
ggdag(dag) + theme_dag_blank()
Notice that we shouldn’t control for Z:
ggdag_adjustment_set(dag)
Let’s simulate some fake data from this DAG. Notice that X does not cause Y:
# fake data
fake = tibble(person = 1:200,
x = rnorm(200),
y = rnorm(200),
z = rnorm(200) + 2*x + -3*y)
When we look at X and Y alone, there’s no relationship:
ggplot(fake, aes(x = x, y = y)) +
geom_point() + geom_smooth(method = "lm")
We can see this relationship in regression too:
# naive regression
lm(y ~ x, data = fake) %>% tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0602 0.0765 -0.788 0.432
## 2 x -0.0991 0.0738 -1.34 0.181
But notice that when we incorrectly control for Z, the estimate on X goes away from 0:
# with controls
lm(y ~ x + z, data = fake) %>% tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00726 0.0235 0.309 7.58e- 1
## 2 x 0.577 0.0274 21.1 2.62e- 52
## 3 z -0.305 0.00698 -43.7 2.68e-103