Relationships

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

Summarize

Let’s load the libraries.

# libraries
library(tidyverse)
library(nycflights13)
library(fivethirtyeight)
library(juanr)

Class: trade

IGOS in 1960:

trade |> 
  filter(year == 1960) |> 
  summarise(med_igos = median(sum_igos))
# A tibble: 1 × 1
  med_igos
     <dbl>
1       33

IGOS in 1960:

trade |> 
  filter(year == 2010) |> 
  summarise(med_igos = median(sum_igos))
# A tibble: 1 × 1
  med_igos
     <dbl>
1       66

At least one sea border:

trade |> 
  filter(year == 2010, sea_borders >= 1) |> 
  summarise(mean_exports = mean(exports, na.rm = TRUE))
# A tibble: 1 × 1
  mean_exports
         <dbl>
1      105389.

No sea border:

trade |> 
  filter(year == 2010, sea_borders >= 1) |> 
  summarise(mean_exports = mean(exports, na.rm = TRUE))
# A tibble: 1 × 1
  mean_exports
         <dbl>
1      105389.

Most exports:

trade |> 
  filter(year == 2012) |> 
  filter(exports == max(exports, na.rm = TRUE))
# A tibble: 1 × 10
  country  year  imports  exports     gdp         pop land_borders sea_borders
  <chr>   <dbl>    <dbl>    <dbl>   <dbl>       <dbl>        <dbl>       <dbl>
1 China    2012 2331123. 2494240. 1.79e13 1310926531.           14           4
# ℹ 2 more variables: min_cap_dist <dbl>, sum_igos <dbl>

Class: feeling thermometer

Attitudes towards the police, comparing Democrats and Republicans:

therm |> 
  filter(party_id %in% c("Democrat", "Republican")) |> 
  group_by(party_id) |> 
  summarise(ft_police = mean(ft_police, na.rm = TRUE))
# A tibble: 2 × 2
  party_id   ft_police
  <fct>          <dbl>
1 Democrat        67.8
2 Republican      87.6

NYC Flights examples

Say we wanted to know how late in departure is the average flight in our dataset and what’s the latest a flight has ever been?

## on average, how late are flights in departing?
flights %>%
  summarise(avg_late = mean(dep_delay, na.rm = TRUE),
            most_late = max(dep_delay, na.rm = TRUE))
# A tibble: 1 × 2
  avg_late most_late
     <dbl>     <dbl>
1     12.6      1301

Not the na.rm = TRUE above and what happens if you remove it. The problem is there are missing values (NA) in our data, and R can’t take the average of a bunch of numbers where some are missing. na.rm = TRUE tells R to ignore those missing numbers and use only the complete observations.

Summarize + group_by()

Say we wanted to know how average departure delays vary across airlines. Conceptually, this means taking the average of departure delays for each airline in the dataset separately. We can do this by combining group_by() and summarise().

# what if we wanted to know these statistics
## for each month in our dataset?
carrier_late = flights %>%
  group_by(carrier) %>%
  summarise(avg_late = mean(dep_delay, na.rm = TRUE),
            most_late = max(dep_delay, na.rm = TRUE))


# make a plot
ggplot(carrier_late, aes(y = reorder(carrier, avg_late), x = avg_late)) +
  geom_col()