Data wrangling I

POL51

Juan Tellez

jftellez@ucdavis.edu

UC Davis

October 3, 2024

Get ready

Boot up Posit Cloud
Download the script for today’s examples
- Schedule ➡️ Example ➡️ Today
Upload the script to Posit Cloud

Plan for today

Wrangling and pipes

Subsetting data

The (tricky!) programming objects

The new starting point

Before, I wrangled data and you plotted the finished product

First step of all your code was ggplot()

Now, you will wrangle the data

First step is now the data object

What is data-wrangling?

…the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics… Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data. – Wikipedia

Most of your time working with data will be spent wrangling it into a usable form for analysis

Pipes: connecting data to functions

change_africa = gapminder |>
    # only Africa
filter(continent == "Africa") |>
    # select down to the columns we need
select(country, year, lifeExp) |>
    # keep first and last year only
filter(year == min(year) | year == max(year)) |>
    # pivot wider so years on columns
pivot_wider(id_cols = country, names_from = year, values_from = lifeExp) |>
    # change from 52 to 07
mutate(difference = `2007` - `1952`) |>
    arrange(desc(difference))

You’ve seen these before…

What are pipes?

Pipes link data to functions
They look like this %>%, or |>
Definitely use keyboard shortcuts
- OSX: Cmd + Shift + M
- Windows: Ctrl + Shift + M

Why pipes?

With pipes: 😍

penguins %>%
    filter(species == "Adelie") %>%
    mutate(body_mass_kg = body_mass_g/1000) %>%
    select(body_mass_kg)

Without pipes: 🤢

select(mutate(filter(penguins, species == "Adelie"), body_mass_kg = body_mass_g/1000),
    body_mass_kg)

Both produce the same output, but pipes make code more legible

Making sense of pipes: “and then…”

change_africa = gapminder |>
    # only Africa
filter(continent == "Africa") |>
    # select down to the columns we need
select(country, year, lifeExp) |>
    # keep first and last year only
filter(year == min(year) | year == max(year)) |>
    # pivot wider so years on columns
pivot_wider(id_cols = country, names_from = year, values_from = lifeExp) |>
    # change from 52 to 07
mutate(difference = `2007` - `1952`) |>
    arrange(desc(difference))

You can read the pipe as if it said “and then”…

Take the data object gapminder, AND THEN
filter so continent is “Africa”, AND THEN
select so that…

Subsetting data and logical operators

Our first wrangling function: `filter()`

filter() subsets data objects based on rules

baby_subset <- babynames %>%
    filter(name == "Angel")

Subset babynames to only babies named Angel

Why filter?

Lots of real-world applications: finding flights, addresses, IDs, etc.

Sometimes we want to focus on a specific subset of data: the South, Latin America, etc.

Useful to deal with common problems: outliers, missing data, strange responses

The Earned Income Tax Credit (EITC)

Third largest welfare program in the US

Only people who meet certain criteria receive it

Effects of program and its design are hotly debated

Identifying beneficiaries

Imagine you are the IRS, and have data on all 360+ million Americans:

Sex	Race	Age	Income	Marital	Children
Female	Black	49	16509	Not married	6
Male	Black	59	19626	Not married	4
Female	White	79	18997	Not married	4
Male	Hispanic	23	9726	Married	0
Male	Hispanic	23	19517	Not married	10
Male	White	24	121214	Not married	4
Male	Hispanic	35	17973	Not married	1

How could use use these variables to identify what benefits they should receive?

Identifying beneficiaries

Say we wanted to identify people in the flat part of the blue line

Income	Marital	Children
16509.23	Not married	6
19625.74	Not married	4
18996.85	Not married	4
9725.74	Married	0
19517.36	Not married	10
121214.23	Not married	4
17973.36	Not married	1

Using `filter()`

To use filter(), we need to tell R which observations we want to include (or exclude) using rules

gap_africa = gapminder |>
    filter(region == "Africa")

Rule: return all observations from gapminder where the region variable is equal to “Africa”

Making the rules: logical operators

Rules filter data based on whether variables meet certain criteria
Rules rely on logical operators:
- Equal to, not equal to, less than, more than, included in, etc.
- Observations that meet the rule are returned; those that are not are dropped

The logical operators

Logical operators.
Operator	meaning
==	equal to
!=	not equal to
>	greater than
<	less than
>=	greater than or equal to
<=	less than or equal to
&	AND (both conditions true)
\|	OR (either condition is true)

Using `filter()`

Say we have some data on 🍎

Data on apples.
name	color	pounds	sweet
Fuji	red	2	TRUE
Gala	green	4	TRUE
Macintosh	green	8	FALSE
Granny Smith	red	3	FALSE

Apples

apples

# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Note

The output reports how many rows and columns our dataset has (4 rows x 4 columns)

Green apples

apples |>
    filter(color == "green")

# A tibble: 2 × 4
  name      color pounds sweet
  <chr>     <chr>  <dbl> <lgl>
1 Gala      green      4 TRUE 
2 Macintosh green      8 FALSE

Notice words are in quotations!

Notice that the number of rows has decreased: 2 x 4

Green and unsweet apples

apples |>
    filter(color == "green") |>
    filter(sweet == FALSE)

# A tibble: 1 × 4
  name      color pounds sweet
  <chr>     <chr>  <dbl> <lgl>
1 Macintosh green      8 FALSE

Notice TRUE/FALSE are all-caps!

Apples that aren’t green

apples |>
    filter(color != "green")

# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Granny Smith red        3 FALSE

The ! symbol negates: not equal to

At least 4 pounds but less than 6

apples |>
    filter(pounds >= 4, pounds < 6)

# A tibble: 1 × 4
  name  color pounds sweet
  <chr> <chr>  <dbl> <lgl>
1 Gala  green      4 TRUE

Notice: at least implies greater than or equal to

I could also split this up over multiple filter calls

apples |>
    filter(pounds >= 4) |>
    filter(pounds < 6)

# A tibble: 1 × 4
  name  color pounds sweet
  <chr> <chr>  <dbl> <lgl>
1 Gala  green      4 TRUE

Combinations: The OR (|) operator

“Observations where either this is true OR that is true”

Apples that are red OR green

apples |>
    filter(color == "red" | color == "green")

# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Note

The | should be above your Return/Enter key

Combinations: the AND operator (&)

The & operator can be used to combine rules

Returns observations where both rules are true

“Apples that are red AND sweet or green AND sour”:

apples |>
    filter(color == "red" & sweet == TRUE | color == "green" & sweet == FALSE)

What can I filter for? the `distinct()` function

To subset your data, you need to know what values your variables can take; what are the continents in gapminder?

distinct() returns the unique values of a variable

gapminder |>
    distinct(continent)

# A tibble: 5 × 1
  continent
  <fct>    
1 Asia     
2 Europe   
3 Africa   
4 Americas 
5 Oceania

I can then use this information to filter my data:

gapminder |>
    filter(continent == "Americas")

Your turn: 👑 World leaders 👑

Using the leader dataset, identify:

A Vietnamese Emperor who, in his first year in office, was 11 years old.
Leaders with graduate degrees who in 2015 reached their 16th year in power.
A leader who held office for more than 20 years, participated in a rebellion, and has a willingness to use force score above 1.7.

10:00

Note

You can use ?leader to see the codebook. The acronym for Vietnam is “VNM”

Objects

The last step: creating objects

Step 1-2: the data, the pipe, the wrangling functions

apples |>
    filter(sweet == FALSE)

Step 3: store the subsetted data as a new object for later use

green_apples = apples |>
    filter(sweet == FALSE)

Objects

In programming, objects can be used to store all sorts of stuff for later use

data, functions, values

We create objects using = or <-

Like this:

new_object = stuff |>
    filter(year == 1999)

Or like this:

new_object <- stuff |>
    filter(year == 1999)

Naming objects

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

Recommend: keep it short, easy to type, informative, and use _ to separate words

# Good
gap_africa = gapminder |>
    filter(continent == "Africa")

# Bad
Countries_In_Africa_That_I_Want_To_Look_Up = gapminder |>
    filter(continent == "Africa")

I use the excellent Tidyverse syntax guide in my work

Failure to object

Without objects, your work washes away, like tears in the rain

Here, we store our data wrangling

green_apples = apples |>
    filter(sweet == FALSE)

green_apples

# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Macintosh    green      8 FALSE
2 Granny Smith red        3 FALSE

Here we didn’t store

apples |>
    filter(sweet == FALSE)

# A tibble: 2 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Macintosh    green      8 FALSE
2 Granny Smith red        3 FALSE

apples

# A tibble: 4 × 4
  name         color pounds sweet
  <chr>        <chr>  <dbl> <lgl>
1 Fuji         red        2 TRUE 
2 Gala         green      4 TRUE 
3 Macintosh    green      8 FALSE
4 Granny Smith red        3 FALSE

Notice the original apples remains unchanged!

The formula

Wrangle the data until you’re satisfied with the output:

apples |>
    filter(sweet == FALSE)

Store the output as a new object:

sour_apples = apples |>
    filter(sweet == FALSE)

Use the new object (e.g., plotting):

ggplot(sour_apples, aes(x = name, y = pounds)) + geom_col()

Challenge: 🗳️ The (unusual) American voter 🗳️

There’s a Twitter bot that randomly tweet profiles of real voters from the Cooperative Election Study:

Challenge: 🗳️ The (unusual) American voter 🗳️

A small sample of Americans
state	sex	age	educ	race	pid7	ideo5	religion	votechoice	hispanic	know_governor	conceal	prochoice	cleanair	wall	mandmin	aca	minwage	newsint
Washington	Male	57	2-year	White	Independent	Liberal	Agnostic	Joe Biden (Democrat)	No	Democrat	Oppose	Support	Support	Oppose	Support	Oppose	Favor	Most of the time
New Jersey	Female	70	High school graduate	White	Not very strong Republican	Moderate	Jewish	Donald J. Trump (Republican)	No	Democrat	Oppose	Support	Oppose	Support	Support	Oppose	Favor	Some of the time
South Carolina	Male	63	4-year	White	Independent	Moderate	Protestant	Donald J. Trump (Republican)	No	Republican	Support	Oppose	Oppose	Support	Support	Support	Oppose	Most of the time
Florida	Male	65	High school graduate	White	Not very strong Democrat	Moderate	Protestant	Joe Biden (Democrat)	No	Republican	Support	Support	Oppose	Oppose	Support	Oppose	Oppose	Most of the time
Ohio	Male	26	High school graduate	White	Lean Republican	Moderate	Protestant	NA	No	Not sure	Oppose	Oppose	Support	Support	Support	Oppose	Favor	Most of the time

🗳️ The (unusual) American voter 🗳️

Using bot:

Identify the most unusual subgroup of voters you can think of
Constraint: need at least five voters in your subgroup
Store your unusual subgroup as an object

Note

Remember you can use ?bot to look at the codebook

10:00

Data wrangling I

Get ready

Plan for today

The new starting point

What is data-wrangling?

Pipes: connecting data to functions

What are pipes?

Why pipes?

Making sense of pipes: “and then…”

Subsetting data and logical operators

Our first wrangling function: filter()

Why filter?

Why filter?

The Earned Income Tax Credit (EITC)

Identifying beneficiaries

Identifying beneficiaries

Using filter()

Making the rules: logical operators

The logical operators

Using filter()

Apples

Green apples

Green and unsweet apples

Apples that aren’t green

At least 4 pounds but less than 6

Combinations: The OR (|) operator

Combinations: the AND operator (&)

What can I filter for? the distinct() function

Your turn: 👑 World leaders 👑

Objects

The last step: creating objects

Objects

Naming objects

Failure to object

The formula

Challenge: 🗳️ The (unusual) American voter 🗳️

Challenge: 🗳️ The (unusual) American voter 🗳️

🗳️ The (unusual) American voter 🗳️

Our first wrangling function: `filter()`

Using `filter()`

Using `filter()`

What can I filter for? the `distinct()` function