# libraries
library(tidyverse)
library(nycflights13)
library(fivethirtyeight)
library(juanr)
Relationships
In-class example
Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”
Summarize
Let’s load the libraries.
Class: trade
IGOS in 1960:
|>
trade filter(year == 1960) |>
summarise(med_igos = median(sum_igos))
# A tibble: 1 × 1
med_igos
<dbl>
1 33
IGOS in 1960:
|>
trade filter(year == 2010) |>
summarise(med_igos = median(sum_igos))
# A tibble: 1 × 1
med_igos
<dbl>
1 66
At least one sea border:
|>
trade filter(year == 2010, sea_borders >= 1) |>
summarise(mean_exports = mean(exports, na.rm = TRUE))
# A tibble: 1 × 1
mean_exports
<dbl>
1 105389.
No sea border:
|>
trade filter(year == 2010, sea_borders >= 1) |>
summarise(mean_exports = mean(exports, na.rm = TRUE))
# A tibble: 1 × 1
mean_exports
<dbl>
1 105389.
Most exports:
|>
trade filter(year == 2012) |>
filter(exports == max(exports, na.rm = TRUE))
# A tibble: 1 × 10
country year imports exports gdp pop land_borders sea_borders
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 China 2012 2331123. 2494240. 1.79e13 1310926531. 14 4
# ℹ 2 more variables: min_cap_dist <dbl>, sum_igos <dbl>
Class: feeling thermometer
Attitudes towards the police, comparing Democrats and Republicans:
|>
therm filter(party_id %in% c("Democrat", "Republican")) |>
group_by(party_id) |>
summarise(ft_police = mean(ft_police, na.rm = TRUE))
# A tibble: 2 × 2
party_id ft_police
<fct> <dbl>
1 Democrat 67.8
2 Republican 87.6
NYC Flights examples
Say we wanted to know how late in departure is the average flight in our dataset and what’s the latest a flight has ever been?
## on average, how late are flights in departing?
%>%
flights summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))
# A tibble: 1 × 2
avg_late most_late
<dbl> <dbl>
1 12.6 1301
Not the na.rm = TRUE
above and what happens if you remove it. The problem is there are missing values (NA
) in our data, and R can’t take the average of a bunch of numbers where some are missing. na.rm = TRUE
tells R to ignore those missing numbers and use only the complete observations.
Summarize + group_by()
Say we wanted to know how average departure delays vary across airlines. Conceptually, this means taking the average of departure delays for each airline in the dataset separately. We can do this by combining group_by()
and summarise()
.
# what if we wanted to know these statistics
## for each month in our dataset?
= flights %>%
carrier_late group_by(carrier) %>%
summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))
# make a plot
ggplot(carrier_late, aes(y = reorder(carrier, avg_late), x = avg_late)) +
geom_col()