Data wrangling I

POL51

Juan F. Tellez

University of California, Davis

December 5, 2023

Plan for today

Wrangling and pipes

Subsetting data

The (tricky!) programming objects

The new starting point

Before, I wrangled data and you plotted the finished product

First step of all your code was ggplot()

Now, you will wrangle the data

First step is now the data object

What is data-wrangling?

…the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics… Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data. – Wikipedia

Most of your time working with data will be spent wrangling it into a usable form for analysis

Pipes: connecting data to functions

directors_profit = movies |> 
  # only look at horror movies
  filter(genre1 == "Horror" | genre2 == "Horror" | genre3 == "Horror") |> 
  # calculate profit
  mutate(profit = gross - budget)


You’ve seen these before…

What are pipes?

  • Pipes link data to functions

  • They look like this %>%, or |>

  • Definitely use keyboard shortcuts

    • OSX: Cmd + Shift + M
    • Windows: Ctrl + Shift + M

Why pipes?

With pipes: 😍

penguins %>% 
  filter(species == "Adelie") %>% 
  mutate(body_mass_kg = body_mass_g / 1000) %>% 
  select(body_mass_kg)

Without pipes: 🤢

select(mutate(filter(penguins, species == "Adelie"), body_mass_kg = body_mass_g / 1000), body_mass_kg)


Both produce the same output, but pipes make code more legible

Making sense of pipes: “and then…”

directors_profit = movies %>% 
  # only look at horror movies
  filter(genre1 == "Horror" | genre2 == "Horror" | genre3 == "Horror") %>% 
  # calculate profit
  mutate(profit = gross - budget)


You can read the pipe as if it said “and then”…

  1. Take the data object movies, AND THEN

  2. filter so genre1, genre2, or genre3 equal HORROR, AND THEN

  3. mutate so that…

Subsetting data and logical operators

Our first wrangling function: filter()

filter() subsets data objects based on rules

baby_subset <- babynames %>%
  filter(name == "Angel")

Subset babynames to only babies named Angel

Why filter?

Why filter?

Lots of real-world applications: finding flights, addresses, IDs, etc.

Sometimes we want to focus on a specific subset of data: the South, Latin America, etc.

Useful to deal with common problems: outliers, missing data, strange responses

The Earned Income Tax Credit (EITC)

Third largest welfare program in the US

Only people who meet certain criteria receive it

Effects of program and its design are hotly debated

Identifying beneficiaries

Imagine you are the IRS, and have data on all 360+ million Americans:

Sex Race Age Income Marital Children
Female White 69 24686 Not married 2
Female Hispanic 55 73867 Not married 0
Female White 83 63949 Not married 4
Female White 29 12396 Not married 1
Female White 23 15868 Not married 3
Male Hispanic 77 23084 Not married 2
Male Hispanic 77 65031 Not married 2

How could use use these variables to identify what benefits they should receive?

Identifying beneficiaries

Say we wanted to identify people in the flat part of the blue line

Income Marital Children
24686.32 Not married 2
73866.74 Not married 0
63949.10 Not married 4
12395.74 Not married 1
15867.56 Not married 3
23084.29 Not married 2
65031.14 Not married 2

Using filter()

To use filter(), we need to tell R which observations we want to include (or exclude) using rules


gap_africa = gapminder |> 
  filter(region == "Africa")

Rule: return all observations from gapminder where the region variable is equal to “Africa”

Making the rules: logical operators

  • Rules filter data based on whether variables meet certain criteria

  • Rules rely on logical operators:

    • Equal to, not equal to, less than, more than, included in, etc.

    • Observations that meet the rule are returned; those that are not are dropped

The logical operators

Operator meaning
== equal to
!= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to
& AND (both conditions true)
| OR (either condition is true)
%in% IN (in the set of)

Using filter()

Say we have some data on 🍎

name color pounds sweet
Fuji red 2 TRUE
Gala green 4 TRUE
Macintosh green 8 FALSE
Granny Smith red 3 FALSE

Apples

apples
# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Note

The output reports how many rows and columns our dataset has (4 rows x 4 columns)

Green apples

apples |> 
  filter(color == "green")
# A tibble: 2 × 4
  name      color pounds sweet
  <chr>     <chr>  <dbl> <lgl>
1 Gala      green      4 TRUE 
2 Macintosh green      8 FALSE

Notice words are in quotations!

Notice that the number of rows has decreased: 2 x 4

Green and unsweet apples

apples |> 
  filter(color == "green") |> 
  filter(sweet == FALSE)
# A tibble: 1 × 4
  name      color pounds sweet
  <chr>     <chr>  <dbl> <lgl>
1 Macintosh green      8 FALSE

Notice TRUE/FALSE are all-caps!

Apples that aren’t green

apples |> 
  filter(color != "green")
# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Granny Smith red        3 FALSE

The ! symbol negates: not equal to

At least 4 pounds but less than 6

apples |> 
  filter(pounds >= 4, pounds < 6)
# A tibble: 1 × 4
  name  color pounds sweet
  <chr> <chr>  <dbl> <lgl>
1 Gala  green      4 TRUE 

Notice: at least implies greater than or equal to

I could also split this up over multiple filter calls

apples |> 
  filter(pounds >= 4) |> 
  filter(pounds < 6)
# A tibble: 1 × 4
  name  color pounds sweet
  <chr> <chr>  <dbl> <lgl>
1 Gala  green      4 TRUE 

Combinations: The OR (|) operator

“Observations where either this is true OR that is true”

Apples that are red OR green

apples |> 
  filter(color == "red" | color == "green")
# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Note

The | should be above your Return/Enter key

Combinations: the AND operator (&)

The & operator can be used to combine rules

Returns observations where both rules are true

“Apples that are red AND sweet or green AND sour”:

apples |> 
  filter(color == "red" & sweet == TRUE | color == "green" & sweet == FALSE)

%in%

The %in% operator is super powerful

It returns observations that belong to a set

Say I wanted to look at just South American countries, normally:

gapminder |> 
  filter(country == "Argentina" | country == "Brazil" | country == "Chile")

Brutally repetitive

%in%

Make a list of countries and return observations that match any of them

keep = c("Argentina", "Brazil", "Chile")

gapminder |> filter(country %in% keep)

Note

To make a “list” of items (a vector), use c()

Your turn: 👑 World leaders 👑

Using the leader dataset, identify:

  1. A Vietnamese Emperor who, in his first year in office, was 11 years old. Famously depraved.

  2. Leaders with graduate degrees who in 2015 reached their 16th year in power.

  3. A leader who held office for more than 20 years, participated in a rebellion, and has a willingness to use force score above 1.7.

10:00

Note

You can use ?leader to see the codebook. The acronym for Vietnam is “VNM”

Objects

The last step: creating objects

Step 1-2: the data, the pipe, the wrangling functions

apples |> 
  filter(sweet == FALSE)


Step 3: store the subsetted data as a new object for later use

green_apples = apples |>
  filter(sweet == FALSE)

Objects

In programming, objects can be used to store all sorts of stuff for later use

data, functions, values

We create objects using = or <-

Like this:

new_object = stuff |> filter(year == 1999)

Or like this:

new_object <- stuff |> filter(year == 1999)

Naming objects

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

Recommend: keep it short, easy to type, informative, and use _ to separate words

# Good
gap_africa = gapminder |> 
  filter(continent == "Africa")

# Bad
Countries_In_Africa_That_I_Want_To_Look_Up = gapminder |> 
  filter(continent == "Africa")

I use the excellent Tidyverse syntax guide in my work

Failure to object

Without objects, your work washes away, like tears in the rain

Here, we store our data wrangling

green_apples = apples |>
  filter(sweet == FALSE)

green_apples
# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Macintosh    green      8 FALSE
2 Granny Smith red        3 FALSE

Here we didn’t store

apples |>
  filter(sweet == FALSE)
# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Macintosh    green      8 FALSE
2 Granny Smith red        3 FALSE
apples
# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Notice the original apples remains unchanged!

The formula

Wrangle the data until you’re satisfied with the output:

apples |> 
  filter(sweet == FALSE)

Store the output as a new object:

sour_apples = apples |> 
  filter(sweet == FALSE)

Use the new object (e.g., plotting):

ggplot(sour_apples, aes(x = name, y = pounds)) + geom_col()

Challenge: 🗳️ The (unusual) American voter 🗳️

There’s a Twitter bot that randomly tweet profiles of real voters from the Cooperative Election Study:

Challenge: 🗳️ The (unusual) American voter 🗳️

A small sample of Americans
state sex age educ race pid7 ideo5 religion votechoice hispanic know_governor conceal prochoice cleanair wall mandmin aca minwage newsint
Arkansas Male 59 Some college White Strong Democrat Liberal Agnostic Joe Biden (Democrat) No Republican Oppose Support Support Oppose Support Oppose Favor Most of the time
New Jersey Female 41 High school graduate White Not very strong Republican Conservative Atheist Donald J. Trump (Republican) No Democrat Oppose Support Support Support Support Oppose Favor Only now and then
Missouri Female 52 High school graduate White Not very strong Republican Moderate Protestant NA No Republican Support Oppose Support Support Support Support Favor Hardly at all
Connecticut Male 49 4-year White Not very strong Republican Conservative Roman Catholic Donald J. Trump (Republican) No Democrat Oppose Oppose Oppose Oppose Support Support Oppose Most of the time
Illinois Female 73 Some college Black Strong Democrat Moderate Protestant Joe Biden (Democrat) No Democrat Oppose Support Support Support Support Oppose Favor Most of the time

🗳️ The (unusual) American voter 🗳️

Using bot:

  1. Identify the most unusual subgroup of voters you can think of

  2. Constraint: need at least five voters in your subgroup

  3. Store your unusual subgroup as an object

Note

Remember you can use ?bot to look at the codebook

10:00

Live code objects example