Data visualization I

POL51

Juan Tellez

UC Davis

September 30, 2024

Plan for today

  • Why visualize data?

  • Telling The Truth™️ with data

  • The grammar of graphics (our first graph)

Why visualize data?

WEB Dubois

(1868 - 1963)

  • American sociologist

  • historian

  • civil rights advocate

  • Data visualization specialist?

These are hand-drawn

Why visualize data?

  • Data carries weight in our society

  • Visualizing data is an effective way to convey information, convince, argue

  • Visualization can be used to tell The Truth™️ (or not)

How can we tell The Truth™️ with data?

Not telling The Truth™️

What’s not true here?

Selectively presenting data

Selectively presenting data

Selectively presenting data is one way of not telling The Truth™️

Summarized data can hide important details

Averages (left) are useful, but can be misleading

Raw data (right) can be more informative

Lying with the Y-axis

  • Lots of shenanigans with the Y-axis, especially when it doesn’t start at zero \(\rightarrow\) exaggerates differences

Is a y-axis that excludes zero misleading?

  • When I was in your shoes, there was a panic about oversupply of lawyers

Is a y-axis that excludes zero misleading?

  • A Y-axis that starts at zero can give us context (and relief) about the magnitude of the change

Should the y-axis always start at zero?

  • “graphs that don’t go to zero are a thought crime” (Fox, 2014)

  • is this necessarily true though?

Counterpoint: both of these graphs contain useful information

Which graph is more informative?

  • Case for right: an average life expectancy of zero is not plausible

Should the y-axis always start at zero?

  • Critics argue that in context, recent warming is not so dramatic

Should the y-axis always start at zero?

But is zooming out useful here? Is “temperature at which dinosaurs went extinct” valid context for us now?

How do we tell the Truth™️?

There’s no one-size-fits-all answer

All visuals highlight some aspects of the data, and obscure others

But some visuals are more truthful than others; beware!

The grammar of graphics

The Gapminder dataset

{gapminder} dataset
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Albania Europe 1952 55 1282697 1601
Algeria Africa 1952 43 9279525 2449
Angola Africa 1952 30 4232095 3521
Argentina Americas 1952 62 17876956 5911

Data on life expectancy, GDP per capita, and population for countries around the world

Rows are observations

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Albania Europe 1952 55 1282697 1601
Algeria Africa 1952 43 9279525 2449
Angola Africa 1952 30 4232095 3521
Argentina Americas 1952 62 17876956 5911

In a dataset, rows are observations

The data we observe for Afghanistan in the year 1952

Rows are observations

id age degree race sex
1 47 Bachelor White Male
2 61 High School White Male
3 72 Bachelor White Male
4 43 High School White Female
5 55 Graduate White Female

In survey data, an observation is typically a person who took the survey (a respondent)

Columns are variables

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 29 8425333 779
Albania Europe 1952 55 1282697 1601
Algeria Africa 1952 43 9279525 2449
Angola Africa 1952 30 4232095 3521
Argentina Americas 1952 62 17876956 5911

In a dataset, columns are variables

Life expectancy and GDP per capita are some of the variables in our data

The final graph

The grammar of graphics

Graphs have an internal logic, or grammar that connects data to visuals

Data = variables in a dataset

Aesthetic = visual property of a graph (position, shape, color, etc.)

Geometry = representation of an aesthetic (point, line, text, etc.)

Mapping data to aesthetics

Data Aesthetic Geometry
GDP per capita Position(x-axis) Point
Life expectancy Position (y-axis) Point
Continent Color Point
Population Size Point
  1. Take the data,

  2. map it onto an aesthetic,

  3. and visualize it with a geometry

In R

Data aes() geom_
gdpPercap x geom_point()
lifeExp y geom_point()
continent color geom_point()
pop size geom_point()

Use the variable names exactly as they appear in the data, map them onto the exact function names in R

ggplot(): our first function 😢

ggplot()

ggplot: specify the data

ggplot(data = gap_07)

Our data is named gap_07 (The Gapminder dataset for the year 2007)

Use aes() to map variables to aesthetics

ggplot(data = gap_07, aes())

Note

aes() goes within ggplot()

Map GDP to the x-axis

ggplot(data = gap_07, aes(x = gdpPercap))

Map Life expectancy to the y-axis

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp))

add (point) geometries using +

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp)) + geom_point()

mapping population to size in aes()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop)) + 
  geom_point()

mapping continent to color in aes()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + geom_point()

Other layers: replace the default titles with labs()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + 
  geom_point() + labs(x = "GDP per capita")

Notice that text is placed within quotation marks!

Other layers: replace the default titles with labs()

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + 
  geom_point() + labs(x = "GDP per capita", y = "Life expectancy", 
       title = "Global wealth and health in 2007")

Notice that text is placed within quotation marks!

The final formula

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, 
                          size = pop, color = continent)) +
  geom_point() + labs(x = "GDP per capita", y = "Life expectancy", 
       title = "Global wealth and health in 2007")
  1. Tell ggplot() the data we want to plot

  2. Map all variables onto aesthetics within aes()

  3. Add layers like geom_point() and labs() using +

What’s that country way out on the bottom right?

🚨 Your turn: try labelling the points 🚨

  1. Add labels to each point by mapping country onto the label aesthetic within aes()

  2. Add a text geometry layer to your plot to plot the names

05:00

The basic plot

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent)) + 
  geom_point()

Map country names to label aesthetic

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent, label = country)) + 
  geom_point()

Plot the labels

ggplot(data = gap_07, aes(x = gdpPercap, y = lifeExp, size = pop, 
                          color = continent, label = country)) + 
  geom_point() + 
  geom_text() 

What did we do?

Data Aesthetic Geometry
gdpPercap x geom_point()
lifeExp y geom_point()
continent color geom_point()
pop size geom_point()
country label geom_text()

Take your data, map it onto an aesthetic, represent with a geometry