Uncertainty

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

Show sample estimates vary

Load the data:

# libraries
library(tidyverse)
library(stevedata)
library(moderndive)

# data
gss_abortion
# A tibble: 64,814 × 18
      id  year   age race  sex    hispaniccat  educ partyid       relactiv abany
   <dbl> <dbl> <dbl> <chr> <chr>        <dbl> <dbl> <chr>            <dbl> <dbl>
 1     1  1972    23 White Female          NA    16 Ind,Near Dem        NA    NA
 2     2  1972    70 White Male            NA    10 Not Str Demo…       NA    NA
 3     3  1972    48 White Female          NA    12 Independent         NA    NA
 4     4  1972    27 White Female          NA    17 Not Str Demo…       NA    NA
 5     5  1972    61 White Female          NA    12 Strong Democ…       NA    NA
 6     6  1972    26 White Male            NA    14 Ind,Near Dem        NA    NA
 7     7  1972    28 White Male            NA    13 Ind,Near Dem        NA    NA
 8     8  1972    27 White Male            NA    16 Ind,Near Dem        NA    NA
 9     9  1972    21 Black Female          NA    12 Strong Democ…       NA    NA
10    10  1972    30 Black Female          NA    12 Strong Democ…       NA    NA
# ℹ 64,804 more rows
# ℹ 8 more variables: abdefect <dbl>, abnomore <dbl>, abhlth <dbl>,
#   abpoor <dbl>, abrape <dbl>, absingle <dbl>, pid <dbl>, hispanic <dbl>

What’s the true population parameter? Pretending dataset is full population:

## abortion true
gss_abortion |> 
  summarise(avg_abnomore = mean(abnomore, na.rm = TRUE))
# A tibble: 1 × 1
  avg_abnomore
         <dbl>
1        0.446

Show estimates vary across samples:

# take many little samples from population
samples = gss_abortion |> 
  rep_sample_n(size = 10, reps = 1000) |> 
  summarise(avg_abnomore = mean(abnomore, na.rm = TRUE))

ggplot(samples, aes(x = avg_abnomore)) + 
  geom_histogram()

Bias the polls

What’s the true population parameter:

gss_abortion |> 
  summarize(abany = mean(abany, na.rm = TRUE))
# A tibble: 1 × 1
  abany
  <dbl>
1 0.414

Bias the polls:

bias_sample = gss_abortion |> 
  filter(educ <= 5) |> 
  rep_sample_n(size = 100, reps = 5000) |> 
  summarize(abany = mean(abany, na.rm = TRUE))


ggplot(bias_sample, aes(x = abany)) + 
  geom_histogram() +
  geom_vline(xintercept = .414, size = 2, color = "red")

Bootstrapping

Look at the data:

## death penalty
issues = read_csv("https://www.dropbox.com/s/x5xncajqsz0q09l/voter-files-issues.csv?dl=1")

Average support for death penalty in Kansas:

just_kansas = issues |> 
  filter(state == "Kansas")
  
just_kansas |> summarise(avg_death = mean(deathpen_2016, na.rm = TRUE))
# A tibble: 1 × 1
  avg_death
      <dbl>
1     0.571

Bootstrap estimates:

boot_kansas = just_kansas |> 
  rep_sample_n(size = nrow(just_kansas), replace = TRUE,
               reps = 1000) |> 
  summarise(avg_death = mean(deathpen_2016, na.rm = TRUE))

Plot the estimates:

ggplot(boot_kansas, aes(x = avg_death)) + 
  geom_histogram()

Calculate the standard error:

# standard error
boot_kansas |> 
  summarise(standard_error = sd(avg_death))
# A tibble: 1 × 1
  standard_error
           <dbl>
1         0.0702

Calculate the 95% CI:

# 95% CI
boot_kansas |> 
  summarise(low = quantile(avg_death, probs = .025),
            high = quantile(avg_death, probs = .975))
# A tibble: 1 × 2
    low  high
  <dbl> <dbl>
1 0.437 0.714