Data visualization I

POL51

Juan F. Tellez

University of California, Davis

December 5, 2023

Plan for today

  • A quick tour of RStudio
  • Why visualize data?
  • The grammar of graphics

A quick tour of RStudio

Tortured metaphor 1: R as a car

Tortured metaphor 1: R as a car

Tortured metaphor 2: RStudio as a kitchen

Tortured metaphor 2: RStudio as a kitchen

Tortured metaphor 3: RStudio as a phone

Packages are where most of our functions and data live

Installing packages

Check out my guide

Or type this into the console and hit return/enter (note the quotation marks!):

install.packages("Name of the package")

Why visualize data?

WEB Dubois

(1868 - 1963)

  • American sociologist

  • historian

  • civil rights advocate

  • Data visualization specialist?

These are hand-drawn

Why visualize data?

  • For better or worse, data carries weight

  • Visualizing data is an effective way to convince, argue, tell stories (and mislead)

  • Graphs, maps, diagrams and other visuals are everywhere

Dataviz to inform

Dataviz to mislead

Inform? or mislead?

Making graphs in R

The Gapminder dataset

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Afghanistan Asia 1957 30 9240934 821
Afghanistan Asia 1962 32 10267083 853
Afghanistan Asia 1967 34 11537966 836
Afghanistan Asia 1972 36 13079460 740

Rows are observations

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Afghanistan Asia 1957 30 9240934 821
Afghanistan Asia 1962 32 10267083 853
Afghanistan Asia 1967 34 11537966 836
Afghanistan Asia 1972 36 13079460 740

In a dataset, rows are observations

The data we observe for Afghanistan in the year 1952

Rows are observations

id age degree race sex
1 47 Bachelor White Male
2 61 High School White Male
3 72 Bachelor White Male
4 43 High School White Female
5 55 Graduate White Female

In survey data, an observation is typically a person who took the survey (a respondent)

Columns are variables

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Afghanistan Asia 1957 30 9240934 821
Afghanistan Asia 1962 32 10267083 853
Afghanistan Asia 1967 34 11537966 836
Afghanistan Asia 1972 36 13079460 740

In a dataset, columns are variables

Life expectancy and GDP per capita are some of the variables in our data

The final graph

The grammar of graphics

Graphs have an internal logic, or grammar that connects data to visuals

Data = variables in a dataset

Aesthetic = visual property of a graph (position, shape, color, etc.)

Geometry = representation of an aesthetic (point, line, text, etc.)

Mapping data to aesthetics

Data Aesthetic Geometry
GDP per capita Position(x-axis) Point
Life expectancy Position (y-axis) Point
Continent Color Point
Population Size Point
  1. Take the data,

  2. map it onto an aesthetic,

  3. and visualize it with a geometry

In R

Data aes() geom_
gdpPercap x geom_point()
lifeExp y geom_point()
continent color geom_point()
pop size geom_point()

Use the variable names exactly as they appear in the data, map them onto the exact function names in R

ggplot(): our first function 😢

ggplot()

ggplot: specify the data

ggplot(data = gap_07)

Use aes() to map variables to aesthetics

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp))

add geometries and layers using +

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp)) + geom_point()

mapping population to size in aes()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop)) + 
  geom_point()

mapping continent to color in aes()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + 
  geom_point()

Other layers: add the missing titles with labs()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + 
  geom_point() + labs(x = "GDP per capita", y = "Life expectancy", 
       title = "Global wealth and health in 2007", size = "Population",
       color = "")

Notice that text is placed within quotation marks!

Other layers: add a theme

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + 
  geom_point() + labs(x = "GDP per capita", y = "Life expectancy", 
       title = "Global wealth and health in 2007") + 
  theme_bw()

There are many more themes, here are a few

The final formula

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, 
                          size = pop, color = continent)) +
  geom_point() + labs(x = "GDP per capita", y = "Life expectancy", 
       title = "Global wealth and health in 2007") +
  theme_bw()
  1. Tell ggplot() the data we want to plot

  2. Map all variables onto aesthetics within aes()

  3. Add layers like geom_point() and theme_bw() using +

What’s that country way out on the bottom right?

🚨 Your turn: try labelling the points 🚨

  1. Add labels to each point by mapping country names onto the label aesthetic within aes()

  2. Add geom_text layer to your plot to plot the names

05:00

The basic plot

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent)) + 
  geom_point()

Map country names to label aesthetic

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent, label = country)) + 
  geom_point()

Plot the labels

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent, label = country)) + 
  geom_point() + 
  geom_text() 

What did we do?

Data Aesthetic Geometry
gdpPercap x geom_point()
lifeExp y geom_point()
continent color geom_point()
pop size geom_point()
country label geom_text()

Take your data, map it onto an aesthetic, represent with a geometry

🇺🇸 The presidents 🇺🇸

Sample of presidential elections
year winner win_party ec_pct popular_pct two_term
1824 John Quincy Adams D.-R. 0.32 0.31 FALSE
1828 Andrew Jackson Dem. 0.68 0.56 TRUE
1832 Andrew Jackson Dem. 0.77 0.55 TRUE
1836 Martin Van Buren Dem. 0.58 0.51 FALSE

🚨 Your turn 🚨

  1. Make a plot of presidential election results using the elections_historic dataset

  2. % of popular vote (x-axis, popular_pct) and % of electoral college vote (y-axis, ec_pct)

  3. map the winner’s party to the color aesthetic, whether or not president served two terms to shape, and add labels to each point (use winner_label)

05:00

US Presidents

What did we do?

Data Aesthetic Geometry
popular_pct x geom_point()
ec_pct y geom_point()
win_party color geom_point()
two_term shape geom_point()
winner_label label geom_text()

Take your data, map it onto an aesthetic, represent with a geometry