Uncertainty

POL51

Juan Tellez

UC Davis

November 21, 2024

Plan for today

Why are we uncertain?

Sampling

Good and bad samples

Where things stand

So far: worrying about causality

how can we know the effect of X on Y is not being confounded by something else?

Last bit: how confident are we in our estimates given…

that our estimates are based on samples?

Uncertainty in the wild

Uncertainty in the wild

Uncertainty in the wild

The “bounds” in geom_smooth tells us something about how confident we should be in the line:

Uncertainty

Polling error, margin of error, uncertainty bounds, all help to quantify how uncertain we feel about an estimate

Vague sense that we are uncertain about what we are estimating

But why are we uncertain? And how can uncertainty be quantified?

Why are we uncertain?

Why are we uncertain?

  • We don’t have all the data we care about

  • We have a sample, such as a survey, or a poll, of a population

  • Problem each sample will look different, and give us a different answer to the question we are trying to answer

Sample locations in Guatemala

What’s going on here? terminology

Term Meaning Example
Population All of the instances of the thing we care about American adults
Population parameter The thing about the population we want to know Average number of kids among American adults
Sample A subset of the population A survey
Sample estimate Our estimate of the population parameter Average number of kids in survey

Boring example: kids

How many children does the average American adult have? (Population parameter)

Let’s pretend there were only 2,867 people living in the USA, and they were all perfectly sampled in gss_sm

A few rows from gss_sm
year age childs degree race sex
2016 38 1 High School Black Female
2016 47 3 Graduate White Male
2016 23 0 High School White Male
2016 57 3 High School White Male
2016 35 2 Bachelor White Female

Boring example: kids

How many children does the average American adult have?

We can get the exact answer, since there are only 2,867 Americans, and they’re all in our data:

gss_sm %>% 
  summarise(avg_kids = mean(childs, na.rm = TRUE))
avg_kids
1.85

The true average number of children in the US is 1.85 (Population parameter)

Sampling

Now imagine that instead of having data on every American, we only have a sample of 10 Americans

Why do we have a sample? Because interviewing every American is prohibitively costly

Same way a poll works: a sample to estimate American public opinion

Sampling

One sample of 10 people using rep_sample_n() from moderndive:

gss_sm %>% rep_sample_n(size = 10, reps = 1)
# A tibble: 10 × 33
# Groups:   replicate [1]
   replicate  year    id ballot       age childs sibs  degree race  sex   region
       <int> <dbl> <dbl> <labelled> <dbl>  <dbl> <lab> <fct>  <fct> <fct> <fct> 
 1         1  2016   422 1             31      2 4     Lt Hi… Black Fema… South…
 2         1  2016  2101 3             65      2 3     High … White Fema… E. No…
 3         1  2016  1166 3             31      3 6     High … White Male  South…
 4         1  2016  2408 1             67      2 0     High … White Male  Pacif…
 5         1  2016  1148 2             52      0 2     High … Other Male  South…
 6         1  2016  1988 1             84      2 1     Bache… White Male  South…
 7         1  2016   755 2             31      2 7     Lt Hi… Other Fema… South…
 8         1  2016  2361 3             62      3 2     Gradu… White Fema… Mount…
 9         1  2016  1515 2             23      0 0     High … White Fema… New E…
10         1  2016   923 1             66      4 1     Bache… White Male  Mount…
# ℹ 22 more variables: income16 <fct>, relig <fct>, marital <fct>, padeg <fct>,
#   madeg <fct>, partyid <fct>, polviews <fct>, happy <fct>, partners <fct>,
#   grass <fct>, zodiac <fct>, pres12 <labelled>, wtssall <dbl>,
#   income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>,
#   religion <fct>, bigregion <fct>, partners_rc <fct>, obama <dbl>

Note

size = size of the sample; reps = number of samples

Sample estimate

We can then calculate the average number of kids among that sample of 10 people

gss_sm %>% rep_sample_n(size = 10, reps = 1) %>% 
  summarise(avg_kids = mean(childs, na.rm = TRUE))
replicateavg_kids
12.3

this is our sample estimate of the population parameter

Notice that it does not equal the true population parameter (1.87)

The trouble with samples

Problem: each sample will give you a different estimate. Instead of taking 1 sample of size 10, let’s take 1,000 samples of size 10:

kids_10 = gss_sm %>% rep_sample_n(size = 10, reps = 1000) %>% 
  summarise(avg_kids = mean(childs, na.rm = TRUE))
kids_10
replicateavg_kids
11.6 
21.8 
31.3 
42.8 
51.9 
62.3 
71.3 
81.7 
90.5 
101.5 
112.6 
120.5 
132.1 
141.3 
152.4 
161.4 
172.7 
182.1 
191.9 
201.4 
212.1 
221.7 
231.7 
241.4 
252.1 
261.6 
272.4 
281.8 
291.5 
301.1 
311.9 
322   
331.2 
341.5 
352.4 
362   
372.2 
381.8 
392.9 
402.5 
412.1 
420.9 
430.9 
441.1 
452   
461.2 
471.7 
481.4 
491.8 
502.7 
512.7 
521.5 
532.3 
542.6 
551.4 
562   
571.8 
582.2 
591.7 
602.5 
612.2 
621.9 
632   
642.4 
651.9 
661.7 
671.2 
681   
691.8 
701.8 
712   
721.4 
731.4 
741.3 
751.6 
761.9 
773   
782.7 
792.1 
801.1 
811.7 
822.4 
832.2 
842.3 
852.2 
862.6 
871.89
882.2 
892.8 
901.3 
912.4 
922   
931.6 
942.2 
951.6 
961.5 
971.3 
981.1 
992.1 
1001.9 
1012   
1021.1 
1031.9 
1042   
1052.5 
1061.5 
1071.2 
1082.3 
1091   
1101.2 
1111.2 
1121.8 
1131.7 
1142.2 
1152.9 
1162.6 
1172.4 
1182.2 
1191.8 
1201.4 
1212.7 
1222.3 
1231.5 
1242.4 
1253.1 
1261.7 
1271.4 
1281.3 
1292.6 
1301.1 
1311.5 
1321.3 
1331.6 
1342.3 
1352.8 
1362.4 
1371.3 
1383.5 
1391.11
1402.5 
1411.3 
1421.4 
1432.2 
1441.7 
1451   
1461.4 
1472.4 
1482.1 
1491.5 
1502.1 
1511.4 
1521.7 
1532.3 
1541.5 
1550.6 
1562.5 
1571.7 
1583   
1592.4 
1601.8 
1611.7 
1621.4 
1631.9 
1641.5 
1651.6 
1661.8 
1671.7 
1681.3 
1690.9 
1703.1 
1712.7 
1721.2 
1732.1 
1742.1 
1752.1 
1762.2 
1771.7 
1782   
1793.1 
1801.5 
1811.8 
1821.8 
1832   
1842.4 
1851.2 
1861   
1871.1 
1882.5 
1892.2 
1901.8 
1911.4 
1923   
1931.5 
1942.4 
1952.7 
1961.8 
1971.7 
1982   
1991.8 
2001.4 
2011.8 
2021.1 
2032.4 
2041.2 
2051.8 
2061.8 
2071.8 
2082.6 
2091.8 
2102.9 
2112.33
2120.9 
2132.7 
2143   
2151.5 
2161.7 
2171.8 
2182.2 
2193.4 
2202.1 
2211.7 
2222.1 
2231.1 
2241.3 
2251.6 
2262.7 
2271.9 
2281.8 
2292.2 
2301   
2311.8 
2322.2 
2332.2 
2341.9 
2352   
2362.1 
2372.8 
2381.9 
2391.6 
2402.3 
2410.4 
2422.5 
2432.1 
2441.9 
2451.4 
2462.78
2471.8 
2481.8 
2492.7 
2501.78
2512.5 
2520.9 
2532.3 
2542   
2553.3 
2561.6 
2571.7 
2582.1 
2591.4 
2602.2 
2611.7 
2622   
2632.2 
2642.6 
2652.2 
2661.9 
2672.5 
2682   
2691.7 
2702.1 
2710.6 
2721.1 
2731   
2742.6 
2751   
2762.4 
2771.7 
2781.2 
2792.2 
2801.4 
2811.8 
2822.2 
2832   
2843.3 
2851.9 
2861.6 
2871.1 
2881.3 
2892   
2901.6 
2912.3 
2921.4 
2931.6 
2941.4 
2951.5 
2962.1 
2972.5 
2981.4 
2992   
3001.6 
3011.9 
3021.9 
3032.2 
3042   
3051.5 
3061.8 
3072.5 
3082.4 
3093.3 
3101.8 
3111.2 
3122.9 
3132.1 
3141.6 
3151.2 
3162.7 
3171.7 
3182.5 
3191.8 
3201.5 
3212.3 
3221.5 
3231.9 
3242.2 
3251.9 
3261   
3271.6 
3283.4 
3290.8 
3302.7 
3312.9 
3322   
3331.9 
3342.5 
3351.56
3361.8 
3371.5 
3380.7 
3391.7 
3401.9 
3412   
3421.8 
3431.8 
3441.1 
3450.9 
3461.9 
3472.5 
3481.9 
3492.3 
3501.2 
3511.6 
3522.1 
3531.5 
3542.1 
3551.6 
3561.6 
3571.6 
3582.4 
3592.2 
3601.4 
3612.2 
3622   
3631.9 
3641.9 
3651.1 
3661.9 
3671.7 
3681.8 
3691.4 
3701.89
3711.8 
3721.5 
3732.6 
3741.2 
3751.4 
3762.5 
3770.7 
3781.3 
3792.5 
3801.7 
3812.1 
3821.7 
3832.2 
3841.2 
3851.2 
3861.6 
3872.3 
3881.8 
3891.6 
3902.1 
3912.3 
3921.6 
3931.9 
3941.8 
3952.2 
3962.2 
3971.56
3982.5 
3992.1 
4002.6 
4011.7 
4023.2 
4031.4 
4041.9 
4052.5 
4061.4 
4071.9 
4080.9 
4091.7 
4101.6 
4111.9 
4121.7 
4132.4 
4142.1 
4151.2 
4162.8 
4171.8 
4182.5 
4191.4 
4201.4 
4211.8 
4222.9 
4232.4 
4242.4 
4252.4 
4261.8 
4273   
4280.9 
4292.89
4301   
4312.2 
4322   
4333.2 
4341   
4351.8 
4361.7 
4370.8 
4382.1 
4391.5 
4401.6 
4411.8 
4421.8 
4432.6 
4442   
4451.67
4462.6 
4470.8 
4481.4 
4490.8 
4502.2 
4511.8 
4522.1 
4532.7 
4541.5 
4550.9 
4562   
4572.3 
4581.2 
4592.1 
4602.6 
4611.5 
4622.2 
4631.9 
4642.4 
4652   
4661.3 
4671.1 
4681.4 
4692.8 
4702.6 
4711.1 
4721.9 
4731.3 
4742.7 
4751.2 
4762.3 
4771.8 
4781.5 
4791.3 
4801.7 
4811   
4821.4 
4831.5 
4842.1 
4851.7 
4861.33
4871.7 
4881.9 
4891.8 
4901.2 
4912.1 
4921.5 
4932.3 
4942.3 
4952.2 
4961.9 
4971.8 
4981   
4991.2 
5002.6 
5011.6 
5022.2 
5032.6 
5041.7 
5051.3 
5061.4 
5071   
5081.6 
5091.2 
5101.33
5111.4 
5121.8 
5132.2 
5143.11
5151.4 
5161.4 
5171.8 
5181.8 
5191.7 
5202.2 
5213.1 
5221.7 
5231.1 
5241.5 
5251.8 
5261.5 
5271.7 
5281.9 
5291   
5302   
5312.2 
5321.9 
5333.2 
5341.8 
5352.1 
5361.1 
5371.5 
5380.8 
5392   
5401.5 
5412   
5421.6 
5432.8 
5442   
5452.4 
5461.5 
5471.7 
5482.1 
5492   
5501.3 
5510.8 
5521.9 
5532.1 
5542.1 
5551.5 
5560.7 
5572.7 
5581.4 
5590.8 
5602.1 
5611.5 
5622.4 
5631.9 
5642.3 
5651.3 
5661.4 
5672.3 
5682.1 
5692.3 
5701.8 
5712.3 
5721.1 
5731.6 
5742   
5752.4 
5761.3 
5771.4 
5781.5 
5792   
5802.5 
5811.9 
5821.6 
5831.4 
5841.7 
5851.6 
5862   
5871.6 
5881.4 
5892.2 
5901.4 
5912.5 
5922.2 
5932.1 
5943.1 
5951.2 
5962.3 
5970.9 
5982.1 
5991.2 
6001.3 
6012.5 
6023.3 
6031.3 
6041.9 
6052.4 
6062.5 
6071.7 
6081.9 
6091.3 
6102.2 
6111.6 
6121.9 
6130.9 
6142.5 
6151.1 
6162   
6171.3 
6182.9 
6191.9 
6202.2 
6212.2 
6221.6 
6232.9 
6242.4 
6252.6 
6261.6 
6271.5 
6281.8 
6292.2 
6302.4 
6313.3 
6322.3 
6331.6 
6341.4 
6351.7 
6361.1 
6373   
6381   
6391.4 
6401.7 
6411.7 
6422.1 
6433.1 
6441.2 
6452.1 
6462   
6471.5 
6482.6 
6491.3 
6501.9 
6511.22
6521.1 
6532.8 
6541.2 
6552.1 
6561.5 
6572.3 
6582.9 
6591.9 
6601.4 
6611.7 
6621.1 
6631.9 
6642.1 
6651.1 
6663.4 
6671.2 
6683.5 
6692.4 
6701.4 
6712.2 
6721.7 
6732.9 
6741.8 
6751.5 
6762.11
6771.4 
6781.7 
6792   
6801.2 
6811.8 
6821.9 
6831.7 
6841.4 
6852.4 
6861.9 
6872.4 
6881.3 
6891.5 
6901.5 
6912.1 
6922.1 
6932.2 
6941.4 
6952.1 
6963.2 
6972.3 
6981.7 
6991.9 
7002.1 
7011.9 
7021.4 
7031.9 
7041.9 
7051.5 
7061.6 
7071.6 
7081.7 
7092.5 
7101.6 
7111.8 
7122.1 
7132.4 
7141.9 
7151.6 
7161.4 
7170.9 
7181.3 
7191.8 
7201.4 
7212.2 
7222.11
7231.9 
7241.8 
7251.8 
7261.4 
7271.7 
7282.1 
7292.2 
7302.3 
7311.7 
7322.4 
7332.3 
7342.3 
7351.9 
7362.1 
7371   
7382.8 
7391.44
7401.7 
7411.4 
7421.6 
7431.8 
7442   
7451.7 
7463.1 
7472.3 
7482.1 
7491.6 
7501   
7511.7 
7522.3 
7532   
7542.1 
7552.9 
7562.1 
7571.6 
7581.9 
7592   
7602.4 
7612.1 
7621.2 
7631.6 
7641.7 
7651.8 
7661.4 
7671.6 
7682.4 
7692.7 
7701.56
7711.7 
7721.5 
7732.2 
7742.5 
7751.6 
7761.4 
7772.2 
7781.3 
7792.89
7801.5 
7812   
7823   
7831.5 
7842.5 
7852.7 
7862.3 
7871.6 
7882.3 
7892.9 
7901.8 
7912.44
7922.3 
7931.4 
7941.7 
7952.8 
7961   
7971.5 
7981.6 
7991.3 
8002.2 
8011.4 
8022.3 
8031.2 
8042.3 
8051.9 
8061.3 
8071.8 
8082.1 
8092.3 
8102.3 
8112.3 
8121.2 
8132.3 
8141.6 
8151.6 
8161.5 
8172.6 
8182   
8191.7 
8202.1 
8212   
8221.5 
8232.6 
8241.9 
8250.8 
8262.6 
8272.2 
8281   
8292.3 
8302.1 
8312.2 
8321.5 
8332   
8342   
8351.8 
8361.8 
8371.9 
8382.2 
8392   
8401.9 
8411.4 
8421.9 
8431.6 
8442.1 
8451.3 
8461.9 
8471.1 
8482.8 
8492.7 
8501.6 
8511.7 
8521.8 
8531.5 
8542.4 
8551.7 
8561.7 
8571.5 
8581.3 
8591.8 
8601.4 
8612.2 
8621.9 
8631.1 
8641   
8651.8 
8662.5 
8672.1 
8682.4 
8691.5 
8701.8 
8711.9 
8721.4 
8732   
8742.11
8751.1 
8761.7 
8772.4 
8781.6 
8792.9 
8802.9 
8811.5 
8821.9 
8831.3 
8841.7 
8853.3 
8861.1 
8871.7 
8882   
8891.3 
8901.8 
8912.4 
8921.7 
8932.22
8941.3 
8952.2 
8961.6 
8971.2 
8982.2 
8991.4 
9002.2 
9012.1 
9022.2 
9031.2 
9041.8 
9051.5 
9062.1 
9072.4 
9081.5 
9091.6 
9101   
9111.8 
9121.5 
9133.1 
9142   
9151.9 
9161.8 
9171.4 
9181.8 
9192   
9201.4 
9211.7 
9221.9 
9231.7 
9241   
9252.2 
9261.7 
9272.1 
9282.89
9291.9 
9301.1 
9311.5 
9321.4 
9332.1 
9341.4 
9352   
9361.7 
9371.9 
9382.2 
9392   
9401.9 
9412   
9422.1 
9431.4 
9441.5 
9452   
9462.5 
9471.7 
9481.7 
9492.1 
9501.6 
9511.7 
9521.7 
9532   
9541.4 
9551.5 
9562.4 
9571.4 
9581.7 
9591.7 
9602   
9612.6 
9622.6 
9631.2 
9640.9 
9652.2 
9662.4 
9671.9 
9681.6 
9691.2 
9701.8 
9711.9 
9721.6 
9730.9 
9742.2 
9751.5 
9761.8 
9772.3 
9782.4 
9791.6 
9802.8 
9812.1 
9822.1 
9832.5 
9841.5 
9852.4 
9861.9 
9871.3 
9881.9 
9892.6 
9901.3 
9913.3 
9921.9 
9932.1 
9941.4 
9951.7 
9961.3 
9972.5 
9981.6 
9990.9 
10002.9 

The trouble with samples

Across 1,000 samples of 10 people each, the estimated average number of kids can vary between 0.4 and 3.5!! Remember, the true average is 1.85

🚨 Your turn: Views on abortion 🚨

Pretend the gss_abortion dataset from stevedata captures how every American feels about abortion:

  1. Find the average level of support for one of the abortion questions.

  2. Now, take 1,000 samples each of size 10 and calculate the average for each sample. How much do your estimates vary from sample to sample? What’s the min/max?

  3. Plot your sample estimates as a geom_histogram or geom_density.

10:00

Why are we uncertain?

We only ever have a sample (10 random Americans), but we’re interested in something bigger: a population (the whole of gss_sm)

Polls often ask a couple thousand people (if that!), and try to infer something bigger (how all Americans feel about the President)

The problem: Each sample is going to give us different results!

Especially worrisome: some estimates will be way off, totally by chance

A problem for regression

This is also a problem for regression, since every regression estimate is based on a sample

What’s the relationship between sex and vote choice among American voters? (pretend gss_sm = whole US)

mod1 = lm(obama ~ sex, data = gss_sm)
tidy(mod1)
termestimatestd.errorstatisticp.value
(Intercept)0.576 0.017732.6 5.84e-182
sexFemale0.08710.02343.720.000206 

So females were 8.7 percent more likely to vote for Obama than males

Regression estimates vary too

Fit a model to samples and you will get a different answer depending on who happens to be in the sample

replicate term estimate
1 sexFemale -0.13
2 sexFemale 0.06
3 sexFemale -0.62
4 sexFemale 0.07
5 sexFemale -0.07
6 sexFemale 0.18
7 sexFemale 0.02
8 sexFemale 0.05

Wrong effect estimates

Many of the effects we estimate below are even negative! This is the opposite of the population parameter (0.087)

The solution

  • So how do we know if our sample estimate is close to the population parameter?

  • Turns out that if a sample is random, representative, and large…

  • …then the LAW OF LARGE NUMBERS tells us that…

  • the sample estimate will be pretty close to the population parameter

Law of large numbers

With a small sample, estimates can vary a lot:

Law of large numbers

As the sample size (N) increases, estimates begin to converge:

Law of large numbers

They become more concentrated around the population average…

Law of large numbers

And eventually it becomes very unlikely the sample estimate is way off

This works for regression estimates, too

Regression estimates also become more precise as sample size increases:

What’s going on?

The larger our sample, the less likely it is that our estimate (average number of kids, the effect of sex on vote choice, etc.) is way off

This is because as sample size increases, sample estimates tend to converge on the population parameter

Next time = we’ll see how to quantify uncertainty based on this tendency

Intuitive = the more data we have, the less uncertain we should feel

But this only works if we have a good sample

Good and bad samples

There are good and bad samples in the world

Good sample representative of the population and unbiased

Bad sample the opposite of a good sample

What does this mean?

Good samples

When sampling goes wrong

Imagine that in our quest to find out how many kids the average American has, we do telephone surveys

Younger people are less likely to have a landline than older people, so few young people make it onto our survey

what happens to our estimate?

When sampling goes wrong

We can simulate this by again pretending gss_sm is the whole of the US

Let’s imagine the extreme scenario where no one under 25 makes it into the survey:

gss_sm %>% 
  filter(age >= 25) %>%
  rep_sample_n(size = 10, reps = 1000) %>% 
  summarise(avg_kids = mean(childs, na.rm = TRUE))

I’ve filtered out all people in gss_sm under 25 so they cannot be sampled

When sampling goes wrong

When sampling goes wrong

As sample size increases, variability of estimates will still decrease

When sampling goes wrong

But estimates will be biased, regardless of sample size

What’s going on?

The sample is not representative of the population (the young people are missing)

This biases our estimate of the population parameter

Randomness is key = everyone needs a similar chance of ending up in the sample

When young people don’t have land-lines, not everyone has a similar chance of ending up in the sample

A big problem!

🚨 Your turn: bias the polls 🚨

Using the gss_abortion data again, imagine you are an evil pollster:

  1. Think about who you would have to exclude from the data to create estimates that benefit the pro-choice and pro-life side of the abortion debate.

  2. Show that even as the sample size increases and estimate variability decreases, we still get biased results.

10:00

Key takeaways

  • We worry that each sample will give us a different answer, and some answers will be very wrong

  • The tendency for sample estimates to approach the population parameter as sample size increases (the law of large numbers) saves us

  • But it all depends on whether we have a random, representative (good) sample; no amount of data in the world will correct for sampling bias