POL51
October 31, 2024
Why causality? And what is it?
Fake data
The problem with causality
We don’t know what’s going to happen in the future
Or in places/cases where we don’t have data
Even if in cases where we have data – what’s our best guess?
We can use models to make decisions informed by patterns in the data
World Bank: what would happen to Jamaica if their GDP went up by 10k?
How to program, visualize data, modeling, relationships, etc.
Look at all the functions you “learned”:
group_by
, tally
, summarise
, filter
, mutate
, %>%
, distinct
, lm
, augment
, tidy
, ggplot
, facet_wrap
…
There are thousands more!
Use models to estimate the relationship between X and Y:
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 54.0 0.315 171. 0
2 gdpPercap 0.000765 0.0000258 29.7 3.57e-156
But is this relationship causal? Would an increase in GDP cause an increase in life expectancy, on average?
“How do we know if X causes y?”
Are our estimates causal? Academics fight about this all day!
This question is at center of causal inference
We will learn why it is so difficult to establish causality with data
We will also learn potential solutions
International relations: do peace-keepers reduce conflict in countries emerging from civil war, or are they ineffective?
Comparative politics: do elections reduce or increase corruption?
American politics: do voter ID laws hurt general turnout?
Many of the interesting questions people want to answer with data are causal
Some are not:
Instagram might want to know: “Is there a person in this photo?”
But not care about what factors cause the picture to be a photo of a person
Depends on the question; most why questions are causal
One of our comparative advantages
Not just academic; companies, governments, NGOs also need to answer “why” questions
Does this policy work (or not)? Did it do what was intended? How effective or counterproductive was it?
In this class, we say X causes Y if…
An intervention that changes the value of X produces a probabilistic change in Y
Intervention = X is being changed or altered
Probabilistic = Y should change, on average, but need not in every instance
How do the two parts of our definition fit here?
Aspirin causes a reduction in fever symptoms
Intervention = someone takes aspirin, we administer aspirin, we sneak it into someone’s food, etc.
Outcome = Taking aspirin doesn’t work 100% of the time; but in general, on average, more often than not, etc., taking aspirin \(\rightarrow\) less fever
What about this example? Democratic institutions reduce the incidence of interstate war
We’ve seen this before:
Variable | Meaning | Examples |
---|---|---|
Y | The thing that is affected by a treatment | Employment, turnout, violence |
X | The thing changing an outcome | Min. wage laws, voter ID laws, peacekeepers |
A heavier car has to work harder to get from A to B
A bigger house is more desirable
States that pass these laws are different from states that don’t pass these laws
To make matters worse, correlation is common-place in nature:
We can’t directly observe a change in X causing a change in Y
This is true even in experiments, where we directly manipulate stuff
All we can see are correlations between X and Y
Some of those correlations are causal; some are not; how can we tell?
People are obsessed with “correlation does not mean causation”
But sometimes it does! that’s the tricky part
rnorm()
functionrnorm()
draws random numbers from a normal distributionGenerate 10 random numbers, most of which are +/- 2 away from 10:
[1] 11.906104 10.517202 8.150035 10.851565 12.973255 10.051876 8.300816
[8] 10.742381 8.096446 7.913750
Generate 5 random numbers, most of which are +/- 10 away from 50:
Say we wanted to simulate data to show that doing well in elections (treatment) causes increases in campaign fundraising (outcome)
That is, that funders reward winners
First step is generating fake elections results (treatment)
Let’s say we have 500 fake elections that are pretty close to 50 (a tie), plus or minus 5 points
We can use tibble()
and rnorm()
to simulate variables
elections (treatment) = 500 elections, vote share is 50% of the vote on average, +/- 5%
Next we want to make up campaign fundraising (outcome)
Let’s say parties raise about 20k on average, plus or minus 4k
fake_election = tibble(party_share = rnorm(n = 500, mean = 50, sd = 5), funding = rnorm(n = 500,
mean = 20000, sd = 4000))
fake_election
# A tibble: 500 × 2
party_share funding
<dbl> <dbl>
1 46.3 10644.
2 44.9 24916.
3 51.8 20045.
4 50.8 19338.
5 45.3 24607.
6 53.9 15414.
7 42.8 24798.
8 68.3 15992.
9 45.1 18680.
10 39.8 16282.
# ℹ 490 more rows
We want to make it so that getting more votes increases campaign donations (winners attract funding)
Say for every percent of the vote a party gets, they get 2,000 (USD) more in donations
this is the causal effect of vote share (treatment) on campaign donations (outcome) \(\rightarrow\)
2,000 (USD) per percent of the vote a party gets in an election
We can do this in R like so:
fake_election = tibble(party_share = rnorm(n = 500, mean = 50, sd = 5), funding = rnorm(n = 500,
mean = 20000, sd = 4000) + 2000 * party_share)
fake_election
# A tibble: 500 × 2
party_share funding
<dbl> <dbl>
1 47.6 115978.
2 50.0 119467.
3 48.7 121534.
4 45.1 111531.
5 54.8 131535.
6 52.2 127072.
7 45.4 106122.
8 54.8 131403.
9 55.9 134592.
10 44.4 114907.
# ℹ 490 more rows
As an equation, this looks like: \(Funding_{i} = Party\_share_{i} \times 2000\)
Is our fake data convincing? We can plot it to see:
We know the effect of vote share on campaign donations: it’s 2k per percent of the vote
Can we use a model to get that estimate back?
Using the steps I just went through above, make up some data that pre-confirms some pattern about the world you wish were true:
Change the treatment and outcome variables in the code to ones of your choosing
Alter the parameters in rnorm()
so the values make sense for your variables (e.g., what is a reasonable distribution for age?)
Make a scatterplot with a trend line – use labs()
to help us make sense of the plot axes!
Post your idea in Slack, winner (creativity + accuracy) will get small extra credit
15:00
Does the spread of democracy reduce international conflict?
Theory: war is costly and the costs are borne by citizens; countries where citizens have more input \(\rightarrow\) less conflict
X here is whether the country is a democracy (versus autocracy)
Y is the number of wars the country is involved in
Ideal, imaginary approach: take a country, look at Y when democracy = 1, and then when democracy = 0
Do this for all countries, take the average
This magical world where we can compare the number of wars when a country is a democracy versus when it is not is called the potential outcomes
country | democracy | war |
---|---|---|
Canada | 0 | 3 |
Canada | 1 | 2 |
China | 0 | 3 |
China | 1 | 2 |
USA | 0 | 2 |
USA | 1 | 2 |
In reality, we can only observe democracy at one value for each country
The US is a democracy, we can observe wars when democracy = 1, but not when democracy = 0
China is not a democracy, we can observe wars when democracy = 0, but not when democracy = 1
country | democracy | war |
---|---|---|
Canada | 0 | 3 |
Canada | 1 | 2 |
China | 0 | 3 |
China | 1 | 2 |
USA | 0 | 2 |
USA | 1 | 2 |
We only observe the world on the right, but not the left
country | democracy | war |
---|---|---|
Canada | 0 | 3 |
Canada | 1 | 2 |
China | 0 | 3 |
China | 1 | 2 |
USA | 0 | 2 |
USA | 1 | 2 |
country | democracy | war |
---|---|---|
Canada | 0 | NA |
Canada | 1 | 2 |
China | 0 | 3 |
China | 1 | NA |
USA | 0 | NA |
USA | 1 | 2 |
We have missing data on “what would have happened” had the US been an autocracy
“what would have happened” \(\rightarrow\) the counterfactual
Our goal in causal inference is to make as good a guess as possible as to what Y would have been had democracy = 0 instead of 1 (and vice versa)
Why not just compare the number of wars for countries where democracy = 0 (autocracies) versus the countries where democracy = 1 (democracies)?
If democracies fight less, then democracy reduces conflict
Implicitly, this means we are saying the countries that are autocracies are good counterfactuals for the countries that are democracies (and vice versa)
For instance, that China is a good counterfactual for the US
But China and the US are different in so many ways! They are not good counterfactuals of one another
We will see exactly why this is a problem in the following weeks
In an experiment, we randomly expose participants to some treatment, while others are exposed to nothing (or a placebo)
Person | Shown an ad? | Democrats thermometer |
---|---|---|
1 | Yes | 105.64 |
2 | No | 4.92 |
3 | Yes | 33.22 |
4 | No | 33.88 |
5 | No | 87.75 |
We then compare the outcome of those who did and didn’t get the treatment
Experiments have the same “problem” as the democracy and war example: we can’t observe the same person seeing and not seeing the ad
Person | Shown an ad? | Democrats thermometer |
---|---|---|
1 | Yes | 105.64 |
2 | No | 4.92 |
But since the experimental ad was randomly assigned, the people that did and didn’t see the ad are good counterfactuals of one another
it is, by definition, just as likely a person who received treatment could have instead received the control/placebo (note how this differs from democracy \(\rightarrow\) war)
This is why experiments are known as the gold standard of research
We have control over treatment, and we randomize it
With observational data that already exists in the world, we don’t have control over treatment, and we can’t randomize it
Experiments are great, but not feasible or ethical in most cases
So we’ll have to take other measures to try to overcome these problems with observational data
With neighbor, think through the counterfactual scenarios in these examples. What is the implicit counterfactual? What would a good counterfactual look like?
A study on whether being a victim of a crime makes someone more supportive of authoritarian leaders
A study on whether post-Dobbs abortion restrictions reduced abortions
10:00