2 - Your data budget

Survival analysis with tidymodels

Data on shelter cats

The city of Long Beach releases data on animals at the Long Beach Animal Shelter.
Our dataset is a sample of the cats entering and leaving the shelter.
Type ?cat_adoption or check the sources to learn more about this dataset, including references.

Data on shelter cats

N = 2257
A time-to-event outcome, consisting of the time spent at the shelter and the event status.
An event is a cat being homed by the Long Beach animal shelter.
If a cat is transfered to a different organization that works to home them, this is recorded as a non-event.
Several nominal variables like sex, intake type and condition, as well as fur color.
Two numeric variables for the location of the intake or capture, latitiude and logitude.

Data on shelter cats

library(tidymodels)
library(censored)
library(modeldata)

cat_adoption
#> # A tibble: 2,257 × 20
#>     time event sex    neutered intake_condition intake_type   latitude longitude
#>    <dbl> <dbl> <fct>  <fct>    <fct>            <fct>            <dbl>     <dbl>
#>  1    17     1 male   yes      fractious        owner_surren…     33.8     -118.
#>  2    98     1 male   yes      normal           stray             33.8     -118.
#>  3    15     0 male   yes      ill_moderatete   owner_surren…     33.8     -118.
#>  4    72     1 female yes      fractious        owner_surren…     33.8     -118.
#>  5    22     0 male   yes      normal           owner_surren…     33.8     -118.
#>  6    66     1 male   yes      normal           owner_surren…     33.8     -118.
#>  7   200     1 female yes      other            other             33.9     -118.
#>  8     9     0 female yes      normal           owner_surren…     33.9     -118.
#>  9    45     1 male   yes      ill_mild         stray             33.8     -118.
#> 10    38     1 male   no       ill_mild         stray             33.9     -118.
#> # ℹ 2,247 more rows
#> # ℹ 12 more variables: black <int>, brown <int>, brown_tabby <int>,
#> #   calico <int>, cream <int>, gray <int>, gray_tabby <int>, orange <int>,
#> #   orange_tabby <int>, tan <int>, tortie <int>, white <int>

Make the outcome

cat_adoption <- cat_adoption |> 
  mutate(event_time = Surv(time, event), .keep = "unused", .before = everything())

Data splitting and spending

For machine learning, we typically split data into training and test sets:

The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.

Do not 🚫 use the test set during training.

Your turn

When is a good time to split your data?

03:00

The initial split

set.seed(123)
cat_split <- initial_split(cat_adoption)
cat_split
#> <Training/Testing/Total>
#> <1692/565/2257>

Accessing the data

cat_train <- training(cat_split)
cat_test <- testing(cat_split)

The training set

cat_train
#> # A tibble: 1,692 × 19
#>    event_time sex     neutered intake_condition   intake_type latitude longitude
#>        <Surv> <fct>   <fct>    <fct>              <fct>          <dbl>     <dbl>
#>  1        33  male    yes      under_age_or_weig… stray           33.8     -118.
#>  2        37  female  yes      normal             stray           33.8     -118.
#>  3        65  female  yes      feral              stray           33.9     -118.
#>  4        10  male    yes      other              stray           33.8     -118.
#>  5         8  female  yes      fractious          stray           33.8     -118.
#>  6        58+ unknown unknown  normal             stray           33.8     -118.
#>  7       452  male    no       under_age_or_weig… stray           33.8     -118.
#>  8        66  male    yes      normal             owner_surr…     33.8     -118.
#>  9        25+ male    no       normal             stray           33.9     -118.
#> 10       111+ male    no       ill_mild           stray           33.8     -118.
#> # ℹ 1,682 more rows
#> # ℹ 12 more variables: black <int>, brown <int>, brown_tabby <int>,
#> #   calico <int>, cream <int>, gray <int>, gray_tabby <int>, orange <int>,
#> #   orange_tabby <int>, tan <int>, tortie <int>, white <int>

Your turn

Split your data so 20% is held out for the test set.

Extension/Challenge: This is a simple random split. Which other types of splits can you think of and does rsample offer corresponding functions?

03:00

Data splitting and spending

set.seed(123)
cat_split <- initial_split(cat_adoption, prop = 0.8)
cat_train <- training(cat_split)
cat_test <- testing(cat_split)

nrow(cat_train)
#> [1] 1805
nrow(cat_test)
#> [1] 452

Data splitting and spending

We’ll use this setup:

set.seed(27)
in_demo <- sample.int(nrow(cat_adoption), 50)
demo_cats <- cat_adoption |> slice(in_demo)

set.seed(123)
cat_split <- initial_split(cat_adoption |> slice(-in_demo), prop = 0.8)
cat_train <- training(cat_split)
cat_test <- testing(cat_split)

Exploratory data analysis for ML 🧐

Your turn

Explore the cat_train data on your own!

What does the Kaplan-Meier curve look like for the outcome, event_time?
How does event_time differ across the categorical variables?
What’s the distribution of the location information, latitude and longitude?

Extension/Challenge:
How would you visualize the relationship between the outcome
and longitude and latitude, respectively?

08:00

Time to adoption

library(ggsurvfit)
survfit(event_time ~ 1, data = cat_adoption) |> ggsurvfit()

survfit(event_time ~ neutered, data = cat_train) |> ggsurvfit()

survfit(event_time ~ brown_tabby, data = cat_train) |> ggsurvfit()

survfit(event_time ~ gray, data = cat_train) |> ggsurvfit()

Location

library(leaflet)

cat_train |> 
  leaflet() |>
  addProviderTiles("CartoDB.Positron") |>  
  addCircles(lng = ~ longitude, 
             lat = ~ latitude)

2 - Your data budget

Data on shelter cats

Data on shelter cats

Data on shelter cats

Make the outcome

Data splitting and spending

Your turn

The initial split

Accessing the data

The training set

Your turn

Data splitting and spending

Data splitting and spending

Exploratory data analysis for ML 🧐

Your turn

Time to adoption

Location

The whole game - status update