Survival analysis is coming to tidymodels!

SatRdays London 2024

Hannah Frick

tidymodels can do survival analysis

tidymodels can deal with time-to-event data

Why tidymodels?

tidymodels is a framework for modelling and

machine learning using tidyverse principles.

Focus on the modelling question,

not the syntax.

Focus on the modelling question,

not the infrastructure for
empirical validation.

Extensive

resampling, preprocessing, models, metrics, tuning strategies

Extendable

Why survival analysis?

Customer churn

wa_churn

#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

What might you want to model with these data?

Let’s try to predict:

How long is somebody going to stay as a customer?

Who is likely to stop being a customer?

How long is somebody going to stay as a customer?

What if we just use the time?

That time is observation time, not time to event.

What if we just use the time?

If we assume that’s time-to-event, we assume everything is an event.

What we actually have

Uncomfy

If we use regression to model time-to-event data, we might

answer a different question
make wrong assumptions
waste information

Who is likely to stop being a customer?

What if we just use the event status?

Who is likely to stop being a customer while we observe them?

Uncomfy

If we use classification to model (time-to-)event data, we

ignore the (possibly wildly) different observation length.

Our challenge

Time-to-event data inherently has two aspects: time and event status.
Censoring: incomplete data is not missing data.

With regression and classification we can only model one aspect, separately, without being able to properly account for the other aspect.

Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).¹

Customer churn

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_spec <- survival_reg() %>%
  set_mode("censored regression") %>%
  set_engine("survival")

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(telco_spec)

telco_fit <- fit(telco_wflow, data = telco_train)

How long is somebody going to stay as a customer?

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1     262.  
#> 2     113.  
#> 3      43.6 
#> 4       6.55
#> 5     130.

Who is likely to stop being a customer?

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = c(12, 24))

pred_survival
#> # A tibble: 5 × 1
#>   .pred           
#>   <list>          
#> 1 <tibble [2 × 2]>
#> 2 <tibble [2 × 2]>
#> 3 <tibble [2 × 2]>
#> 4 <tibble [2 × 2]>
#> 5 <tibble [2 × 2]>

Who is likely to stop being a customer?

pred_survival$.pred[[1]]
#> # A tibble: 2 × 2
#>   .eval_time .pred_survival
#>        <dbl>          <dbl>
#> 1         12          0.931
#> 2         24          0.878

tidymodels for survival analysis

Models:
parametric, semi-parametric, and tree-based
Predictions:
survival time, survival probability, hazard, and linear predictor
Metrics:
concordance index, Brier score, integrated Brier score, AUC ROC

tidymodels for survival analysis

Survival analysis is coming to tidymodels!

tidymodels can do survival analysis

tidymodels can deal with time-to-event data

Why tidymodels?

Extensive

Extendable

Why survival analysis?

Customer churn

What might you want to model with these data?

How long is somebody going to stay as a customer?

What if we just use the time?

What if we just use the time?

What we actually have

Uncomfy

Who is likely to stop being a customer?

What if we just use the event status?

Uncomfy

Our challenge

Customer churn

Split the data

A single model

How long is somebody going to stay as a customer?

Who is likely to stop being a customer?

Who is likely to stop being a customer?

tidymodels for survival analysis

tidymodels for survival analysis

Learn more via articles on tidymodels.org