Data Wrangling

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

wrangle-day1.R

Filtering

Often, we have a big dataset that covers lots of stuff (say, all flights coming out of NYC in 2013) but we’re only interested in a subset of those things (say, flights that arrived late over that time period). The filter() function is a way to subset operations that match some rule or set of rules (e.g., rule = “flights that arrived late”). We define these rules using logical operators.

Examples

Let’s load the libraries.

# libraries
library(tidyverse)
library(nycflights13)

Remember you can look at the data like this.

# look at the data
View(flights) # open data in viewer
?flights # read data documentation

Let’s look at flights from February.

# look at fights, but only from February
flights %>%
  filter(month == 2)
## # A tibble: 24,951 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     2     1      456            500        -4      652            648
##  2  2013     2     1      520            525        -5      816            820
##  3  2013     2     1      527            530        -3      837            829
##  4  2013     2     1      532            540        -8     1007           1017
##  5  2013     2     1      540            540         0      859            850
##  6  2013     2     1      552            600        -8      714            715
##  7  2013     2     1      552            600        -8      919            910
##  8  2013     2     1      552            600        -8      655            709
##  9  2013     2     1      553            600        -7      833            815
## 10  2013     2     1      553            600        -7      821            825
## # … with 24,941 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Let’s look at flights on Valentine’s Day.

# now let's look at flights on Valentine's Day
flights %>%
  filter(month == 2) %>%
  filter(day == 14)
## # A tibble: 956 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     2    14        7           2352        15      448            437
##  2  2013     2    14       59           2339        80      205            106
##  3  2013     2    14      454            500        -6      641            648
##  4  2013     2    14      510            515        -5      750            814
##  5  2013     2    14      531            530         1      828            831
##  6  2013     2    14      541            540         1      850            850
##  7  2013     2    14      542            545        -3     1014           1023
##  8  2013     2    14      551            600        -9      831            906
##  9  2013     2    14      552            600        -8      657            708
## 10  2013     2    14      553            600        -7      902            856
## # … with 946 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Let’s try the OR logical operator by looking at flights going to ATL or SFO.

# try one using text and the OR symbol
# look at fights going to ATL or SFO
flights %>%
  filter(dest == "ATL" | dest == "SFO")
## # A tibble: 30,546 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      554            600        -6      812            837
##  2  2013     1     1      558            600        -2      923            937
##  3  2013     1     1      600            600         0      837            825
##  4  2013     1     1      606            610        -4      837            845
##  5  2013     1     1      611            600        11      945            931
##  6  2013     1     1      615            615         0      833            842
##  7  2013     1     1      655            700        -5     1037           1045
##  8  2013     1     1      658            700        -2      944            939
##  9  2013     1     1      729            730        -1     1049           1115
## 10  2013     1     1      734            737        -3     1047           1113
## # … with 30,536 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Let’s look at flights between noon and 5pm.

# try one using greater than or less than
# look at flights departing between 12pm and 5pm
flights %>%
  filter(dep_time >= 1200) %>%
  filter(dep_time <= 1700)
## # A tibble: 99,136 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1200           1200         0     1408           1356
##  2  2013     1     1     1202           1207        -5     1318           1314
##  3  2013     1     1     1202           1159         3     1645           1653
##  4  2013     1     1     1203           1205        -2     1501           1437
##  5  2013     1     1     1203           1200         3     1519           1545
##  6  2013     1     1     1204           1200         4     1500           1448
##  7  2013     1     1     1205           1200         5     1503           1505
##  8  2013     1     1     1206           1209        -3     1325           1328
##  9  2013     1     1     1208           1158        10     1540           1502
## 10  2013     1     1     1211           1215        -4     1423           1413
## # … with 99,126 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Let’s look at how many flights arrived late on christmas day.

## how many flights arrived LATE, on christmas day?
late_xmas = flights %>%
  filter(arr_time > sched_arr_time) %>%
  filter(month == 12, day == 25)

Leaders

library(juanr)
leader
## # A tibble: 17,686 × 16
##    country gwcode leader gender  year yr_office   age edu     mil_service combat
##    <chr>    <dbl> <chr>  <chr>  <dbl>     <dbl> <dbl> <fct>         <dbl>  <dbl>
##  1 USA          2 Grant  M       1869         1    47 Univer…           1      1
##  2 USA          2 Grant  M       1870         2    48 Univer…           1      1
##  3 USA          2 Grant  M       1871         3    49 Univer…           1      1
##  4 USA          2 Grant  M       1872         4    50 Univer…           1      1
##  5 USA          2 Grant  M       1873         5    51 Univer…           1      1
##  6 USA          2 Grant  M       1874         6    52 Univer…           1      1
##  7 USA          2 Grant  M       1875         7    53 Univer…           1      1
##  8 USA          2 Grant  M       1876         8    54 Univer…           1      1
##  9 USA          2 Grant  M       1877         9    55 Univer…           1      1
## 10 USA          2 Hayes  M       1877         1    55 Gradua…           1      1
## # … with 17,676 more rows, and 6 more variables: rebel <dbl>, yrs_exp <dbl>,
## #   phys_health <dbl>, mental_health <dbl>, will_force <dbl>,
## #   will_force_sd <dbl>
  1. A Vietnamese Emperor who, in his first year in office, was 11 years old. Famously depraved.
leader %>%
  # first year in office
  filter(yr_office == 1) %>%
  # age at that point
  filter(age == 11) %>%
  # vietnamese
  filter(country == "VNM")
## # A tibble: 1 × 16
##   country gwcode leader   gender  year yr_office   age edu    mil_service combat
##   <chr>    <dbl> <chr>    <chr>  <dbl>     <dbl> <dbl> <fct>        <dbl>  <dbl>
## 1 VNM        815 Thanh T… M       1889         1    11 Secon…           0      0
## # … with 6 more variables: rebel <dbl>, yrs_exp <dbl>, phys_health <dbl>,
## #   mental_health <dbl>, will_force <dbl>, will_force_sd <dbl>
  1. Leaders with graduate degrees who in 2015 reached their 16th year in power.
leader %>%
  filter(edu == "Graduate", yr_office == 16, year == 2015)
## # A tibble: 2 × 16
##   country gwcode leader    gender  year yr_office   age edu   mil_service combat
##   <chr>    <dbl> <chr>     <chr>  <dbl>     <dbl> <dbl> <fct>       <dbl>  <dbl>
## 1 RUS        365 Putin     M       2015        16    63 Grad…           0      0
## 2 SYR        652 Bashar a… M       2015        16    50 Grad…           1      0
## # … with 6 more variables: rebel <dbl>, yrs_exp <dbl>, phys_health <dbl>,
## #   mental_health <dbl>, will_force <dbl>, will_force_sd <dbl>
  1. The number of world leaders in the post-2000 period who have known physical or mental health issues.
leader %>%
  filter((year > 2000) & (phys_health == 1 | mental_health == 1))
## # A tibble: 103 × 16
##    country gwcode leader  gender  year yr_office   age edu    mil_service combat
##    <chr>    <dbl> <chr>   <chr>  <dbl>     <dbl> <dbl> <fct>        <dbl>  <dbl>
##  1 CAN         20 Chreti… M       2001         9    62 Unive…           0      0
##  2 CAN         20 Chreti… M       2002        10    63 Unive…           0      0
##  3 CAN         20 Chreti… M       2003        11    64 Unive…           0      0
##  4 SLV        349 Drnovs… M       2001         2    51 Gradu…           0      0
##  5 SLV        349 Drnovs… M       2002         3    52 Gradu…           0      0
##  6 BLR        370 Lukash… M       2001         8    47 Gradu…           1      0
##  7 BLR        370 Lukash… M       2002         9    48 Gradu…           1      0
##  8 BLR        370 Lukash… M       2003        10    49 Gradu…           1      0
##  9 BLR        370 Lukash… M       2004        11    50 Gradu…           1      0
## 10 BLR        370 Lukash… M       2005        12    51 Gradu…           1      0
## # … with 93 more rows, and 6 more variables: rebel <dbl>, yrs_exp <dbl>,
## #   phys_health <dbl>, mental_health <dbl>, will_force <dbl>,
## #   will_force_sd <dbl>

Mutating

Sometimes we want to create new variables. For example, we might want to combine or alter existing variables in our dataset. The mutate() function is one way of doing this.

Let’s convert arrival delay from minutes to hours.

## convert arrival_delay to hours
new_flights = flights %>%
  mutate(arr_delay_hrs = arr_delay/60)

If you look in the dataset you will see a new variable called arr_delay_hrs.

Let’s convert distance traveled from miles to thousands of miles.

## convert distance to thousands of miles
new_flights2 = flights %>%
  mutate(dist_miles = distance/1000)

Creating categorical variables

Sometimes we want to create more complicated variables. Here’s where case_when comes into play.

Let’s create a variable that tells us what season a flight took off in.

## create a new variable called season
## that tells me if flight departed
## in summer, winter, fall, or spring
new_flights = flights %>%
  mutate(seasons = case_when(month == 6 ~ "Summer",
                             month == 7 ~ "Summer",
                             month == 8 ~ "Summer",
                             month == 9 ~ "Fall",
                             month == 10 ~ "Fall",
                             month == 11 ~ "Fall",
                             month == 12 ~ "Winter",
                             month == 1 ~ "Winter",
                             month == 2 ~ "Winter",
                             month == 3 ~ "Spring",
                             month == 4 ~ "Spring",
                             month == 5 ~ "Spring"))

We can then plot the distribution of arrival delays by season, below.

# plot histogram of arrival delay
# separate it by season
ggplot(new_flights, aes(x = arr_delay, fill = seasons)) +
  geom_histogram() +
  facet_wrap(vars(seasons))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9430 rows containing non-finite values (stat_bin).

Let’s say we wanted to categorize flights by how late they are. See an example, below.

new_flights = flights %>%
  mutate(time_flight = case_when(arr_delay >= 120 ~ "very late",
                                 arr_delay > 0 & arr_delay < 120 ~ "a little late",
                                 arr_delay == 0 ~ "on time",
                                 arr_delay < 0 & arr_delay > -120 ~ "a little early",
                                 arr_delay <=-120 ~ "very early"))