2 - Your data budget

Survival analysis with tidymodels

Data on shelter cats

  • The city of Long Beach releases data on animals at the Long Beach Animal Shelter.
  • Our dataset is a sample of the cats entering and leaving the shelter.
  • Type ?cat_adoption or check the sources to learn more about this dataset, including references.

Data on shelter cats

  • N = 2257
  • A time-to-event outcome, consisting of the time spent at the shelter and the event status.
  • An event is a cat being homed by the Long Beach animal shelter.
  • If a cat is transfered to a different organization that works to home them, this is recorded as a non-event.
  • Several nominal variables like sex, intake type and condition, as well as fur color.
  • Two numeric variables for the location of the intake or capture, latitiude and logitude.

Data on shelter cats

library(tidymodels)
library(censored)
library(modeldata)

cat_adoption
#> # A tibble: 2,257 × 20
#>     time event sex    neutered intake_condition intake_type   latitude longitude
#>    <dbl> <dbl> <fct>  <fct>    <fct>            <fct>            <dbl>     <dbl>
#>  1    17     1 male   yes      fractious        owner_surren…     33.8     -118.
#>  2    98     1 male   yes      normal           stray             33.8     -118.
#>  3    15     0 male   yes      ill_moderatete   owner_surren…     33.8     -118.
#>  4    72     1 female yes      fractious        owner_surren…     33.8     -118.
#>  5    22     0 male   yes      normal           owner_surren…     33.8     -118.
#>  6    66     1 male   yes      normal           owner_surren…     33.8     -118.
#>  7   200     1 female yes      other            other             33.9     -118.
#>  8     9     0 female yes      normal           owner_surren…     33.9     -118.
#>  9    45     1 male   yes      ill_mild         stray             33.8     -118.
#> 10    38     1 male   no       ill_mild         stray             33.9     -118.
#> # ℹ 2,247 more rows
#> # ℹ 12 more variables: black <int>, brown <int>, brown_tabby <int>,
#> #   calico <int>, cream <int>, gray <int>, gray_tabby <int>, orange <int>,
#> #   orange_tabby <int>, tan <int>, tortie <int>, white <int>

Make the outcome

cat_adoption <- cat_adoption %>% 
  mutate(event_time = Surv(time, event), .keep = "unused", .before = everything()) 

Data splitting and spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Your turn

When is a good time to split your data?

03:00

The initial split

set.seed(123)
cat_split <- initial_split(cat_adoption)
cat_split
#> <Training/Testing/Total>
#> <1692/565/2257>

Accessing the data

cat_train <- training(cat_split)
cat_test <- testing(cat_split)

The training set

cat_train
#> # A tibble: 1,692 × 19
#>    event_time sex     neutered intake_condition   intake_type latitude longitude
#>        <Surv> <fct>   <fct>    <fct>              <fct>          <dbl>     <dbl>
#>  1        33  male    yes      under_age_or_weig… stray           33.8     -118.
#>  2        37  female  yes      normal             stray           33.8     -118.
#>  3        65  female  yes      feral              stray           33.9     -118.
#>  4        10  male    yes      other              stray           33.8     -118.
#>  5         8  female  yes      fractious          stray           33.8     -118.
#>  6        58+ unknown unknown  normal             stray           33.8     -118.
#>  7       452  male    no       under_age_or_weig… stray           33.8     -118.
#>  8        66  male    yes      normal             owner_surr…     33.8     -118.
#>  9        25+ male    no       normal             stray           33.9     -118.
#> 10       111+ male    no       ill_mild           stray           33.8     -118.
#> # ℹ 1,682 more rows
#> # ℹ 12 more variables: black <int>, brown <int>, brown_tabby <int>,
#> #   calico <int>, cream <int>, gray <int>, gray_tabby <int>, orange <int>,
#> #   orange_tabby <int>, tan <int>, tortie <int>, white <int>

Your turn

Split your data so 20% is held out for the test set.


Extension/Challenge: This is a simple random split. Which other types of splits can you think of and does rsample offer corresponding functions?

03:00

Data splitting and spending

set.seed(123)
cat_split <- initial_split(cat_adoption, prop = 0.8)
cat_train <- training(cat_split)
cat_test <- testing(cat_split)

nrow(cat_train)
#> [1] 1805
nrow(cat_test)
#> [1] 452

Data splitting and spending

We’ll use this setup:

set.seed(27)
in_demo <- sample.int(nrow(cat_adoption), 50)
demo_cats <- cat_adoption %>% slice(in_demo)

set.seed(123)
cat_split <- initial_split(cat_adoption %>% slice(-in_demo), prop = 0.8)
cat_train <- training(cat_split)
cat_test <- testing(cat_split)

Exploratory data analysis for ML 🧐

Your turn

Explore the cat_train data on your own!

  • What does the Kaplan-Meier curve look like for the outcome, event_time?
  • How does event_time differ across the categorical variables?
  • What’s the distribution of the location information, latitude and longitude?


Extension/Challenge:
How would you visualize the relationship between the outcome
and longitude and latitude, respectively?

08:00

Time to adoption

library(ggsurvfit)
survfit(event_time ~ 1, data = cat_adoption) %>% ggsurvfit()

survfit(event_time ~ neutered, data = cat_train) %>% ggsurvfit()

survfit(event_time ~ brown_tabby, data = cat_train) %>% ggsurvfit()

survfit(event_time ~ gray, data = cat_train) %>% ggsurvfit()

Location

library(leaflet)

cat_train %>% 
  leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%  
  addCircles(lng = ~ longitude, 
             lat = ~ latitude)

The whole game - status update