Prediction

In-class example

Here’s the code we’ll be using in class. Download it and store it with the rest of your materials for this course. If simply clicking doesn’t trigger download, you should right-click and select “save link as…”

colony

How much would a noble (noble), without military experience (military) expect to pay (rprice1) for a governorship with a suitability index of .8 (suitindex) and a repartimiento (reparto2) of 98,000 pesos?

Step 1: fit model

colony_model = lm(rprice1 ~ noble + military + suitindex + reparto2, data = colony)
tidy(colony_model)
# A tibble: 5 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  2862.     718.          3.99  0.0000768
2 noble        -711.     689.         -1.03  0.303    
3 military    -1128.     661.         -1.71  0.0885   
4 suitindex     215.     668.          0.321 0.748    
5 reparto2        0.0260   0.00625     4.16  0.0000375

Step 2: define scenario

scen = crossing(noble = 1, military = 0, suitindex = 0.8, reparto2 = 98000)
scen
# A tibble: 1 × 4
  noble military suitindex reparto2
  <dbl>    <dbl>     <dbl>    <dbl>
1     1        0       0.8    98000

Step 3: get estimate

augment(colony_model, newdata = scen)
# A tibble: 1 × 5
  noble military suitindex reparto2 .fitted
  <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
1     1        0       0.8    98000   4869.

movies

Note: turns out the movies dataset has an issue with the way I stored the title variable that makes it very difficult to filter() with 😔. I found a solution below, but you don’t have to worry about knowing it.

Fit a model that predicts gross (outcome) using genre1, duration, budget, year, imdb_score, and whether or not it’s in color.

movies_model = lm(gross ~ genre1 + duration + budget + year + imdb_score + color, data = movies)
tidy(movies_model)
# A tibble: 22 × 5
   term                estimate  std.error statistic   p.value
   <chr>                  <dbl>      <dbl>     <dbl>     <dbl>
 1 (Intercept)       818547152. 207136694.      3.95 0.0000793
 2 genre1Adventure    10415044.   3603363.      2.89 0.00388  
 3 genre1Animation    17298146.   8886217.      1.95 0.0517   
 4 genre1Biography   -10777045.   5146622.     -2.09 0.0363   
 5 genre1Comedy        7659542.   2856661.      2.68 0.00737  
 6 genre1Crime       -14882152.   4347999.     -3.42 0.000628 
 7 genre1Documentary -10605535.   9475879.     -1.12 0.263    
 8 genre1Drama       -11407163.   3278448.     -3.48 0.000509 
 9 genre1Family      112609248.  30165053.      3.73 0.000193 
10 genre1Fantasy      11158684.   9641566.      1.16 0.247    
# ℹ 12 more rows

Look up a movie in the dataset. How well does the model predict a movie that shares that movie’s characteristics?

movies |> filter(str_detect(title, "Spider-Man 3")) ## note: there is something weird about how I coded the title of these movies that makes filter not work; here's the solution I found; you're not responsible for this solution
# A tibble: 2 × 13
  title         year decade director genre1 genre2 genre3 duration  gross budget
  <chr>        <dbl> <fct>  <chr>    <chr>  <chr>  <chr>     <dbl>  <dbl>  <dbl>
1 Spider-Man …  2007 2000s  Sam Rai… Action Adven… Roman…      156 3.37e8 2.58e8
2 Spider-Man …  2007 2000s  Sam Rai… Action Adven… Roman…      156 3.37e8 2.58e8
# ℹ 3 more variables: imdb_score <dbl>, color <chr>, content_rating <chr>

Define scenario:

spider3 = crossing(year = 2007, genre1 = "Action", duration = 156, budget = 2.58e8,
                   imdb_score = 6.2, color = "Color")

Get prediction:

augment(movies_model, newdata = spider3)
# A tibble: 1 × 7
   year genre1 duration    budget imdb_score color    .fitted
  <dbl> <chr>     <dbl>     <dbl>      <dbl> <chr>      <dbl>
1  2007 Action      156 258000000        6.2 Color 284536720.

How wrong is our model about Spider-Man 3 (in-sample)? Actual - predicted:

336530303 - 284536720
[1] 51993583

Our model under-estimated how much it grossed by $51,993,583.

How would our model do out of sample? Spider-Man: Into the Spider-Verse: https://www.imdb.com/title/tt4633694/

spiderman5 = crossing(year = 2018, genre1 = "Animation", duration = 117,
                      budget = 90000000, imdb_score = 8.4, color = "Color")
spiderman5
# A tibble: 1 × 6
   year genre1    duration   budget imdb_score color
  <dbl> <chr>        <dbl>    <dbl>      <dbl> <chr>
1  2018 Animation      117 90000000        8.4 Color

Get estimate:

augment(movies_model, newdata = spiderman5)
# A tibble: 1 × 7
   year genre1    duration   budget imdb_score color    .fitted
  <dbl> <chr>        <dbl>    <dbl>      <dbl> <chr>      <dbl>
1  2018 Animation      117 90000000        8.4 Color 152766707.

How wrong are we out of sample?

384298736 - 152766707
[1] 231532029

Our model under-estimated gross by: $231,532,029