Summarizing Data
In-class example
Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”
Summarize
Let’s load the libraries.
# libraries
library(tidyverse)
library(nycflights13)
library(fivethirtyeight)
Say we want to take the average of a variable in our dataset. summarize()
can help us do that.
Say we wanted to know how late in departure is the average flight in our dataset and what’s the latest a flight has ever been?
## on average, how late are flights in departing?
flights %>%
summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 2
## avg_late most_late
## <dbl> <dbl>
## 1 12.6 1301
Not the na.rm = TRUE
above and what happens if you remove it. The problem is there are missing values (NA
) in our data, and R can’t take the average of a bunch of numbers where some are missing. na.rm = TRUE
tells R to ignore those missing numbers and use only the complete observations.
Summarize + group_by()
Say we wanted to know how average departure delays vary across airlines. Conceptually, this means taking the average of departure delays for each airline in the dataset separately. We can do this by combining group_by()
and summarise()
.
# what if we wanted to know these statistics
## for each month in our dataset?
carrier_late = flights %>%
group_by(carrier) %>%
summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))
# make a plot
ggplot(carrier_late, aes(x = carrier, y = avg_late)) +
geom_col() +
coord_flip()
The Bob Ross example
Happy tree?
bob_ross %>%
summarise(prop_tree = mean(tree, na.rm = TRUE))
## # A tibble: 1 × 1
## prop_tree
## <dbl>
## 1 0.896
Clouds over time?
bob_clouds = bob_ross %>%
group_by(season) %>%
summarise(prop_clouds = mean(clouds, na.rm = TRUE))
ggplot(bob_clouds, aes(x = season, y = prop_clouds)) + geom_line()
snowy mountain?
bob_ross %>%
filter(mountain == 1) %>%
summarise(snowiness = mean(snowy_mountain, na.rm = TRUE))
## # A tibble: 1 × 1
## snowiness
## <dbl>
## 1 0.681
bob_ross %>%
group_by(mountain) %>%
summarise(snowiness = mean(snowy_mountain, na.rm = TRUE))
## # A tibble: 2 × 2
## mountain snowiness
## <int> <dbl>
## 1 0 0
## 2 1 0.681
Steve ross?
bob_ross %>%
group_by(steve_ross) %>%
summarise(lake_chance = mean(lake, na.rm = TRUE))
## # A tibble: 2 × 2
## steve_ross lake_chance
## <int> <dbl>
## 1 0 0.339
## 2 1 0.909
The flying etiquette example
Middle arm rest?
middle_arm_rests = flying %>%
group_by(two_arm_rests) %>%
tally() %>%
mutate(percent = n/sum(n))
ggplot(middle_arm_rests, aes(x = percent, y = two_arm_rests)) +
geom_col()
Unruly children?
nasty_kids = flying %>%
group_by(children_under_18, unruly_child) %>%
tally() %>%
mutate(p_unruly = n/sum(n))
ggplot(nasty_kids, aes(x = unruly_child, y = p_unruly, fill = children_under_18)) + geom_col(position = "dodge")