Survival analysis is coming to tidymodels!

SatRdays London 2024

Hannah Frick

tidymodels can do survival analysis

tidymodels can deal with time-to-event data

Why tidymodels?




tidymodels is a framework for modelling and

machine learning using tidyverse principles.




Focus on the modelling question,

not the syntax.




Focus on the modelling question,

not the infrastructure for
empirical validation.

Extensive

resampling, preprocessing, models, metrics, tuning strategies

Extendable

Why survival analysis?

Customer churn

wa_churn
#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

What might you want to model with these data?

Let’s try to predict:

  • How long is somebody going to stay as a customer?
  • Who is likely to stop being a customer?

How long is somebody going to stay as a customer?

What if we just use the time?

That time is observation time, not time to event.

What if we just use the time?

If we assume that’s time-to-event, we assume everything is an event.

What we actually have

Uncomfy

If we use regression to model time-to-event data, we might

  • answer a different question
  • make wrong assumptions
  • waste information

Who is likely to stop being a customer?

What if we just use the event status?

Who is likely to stop being a customer while we observe them?

Uncomfy



If we use classification to model (time-to-)event data, we

ignore the (possibly wildly) different observation length.

Our challenge

  • Time-to-event data inherently has two aspects: time and event status.
  • Censoring: incomplete data is not missing data.
  • With regression and classification we can only model one aspect, separately, without being able to properly account for the other aspect.



Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).1

Customer churn

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_spec <- survival_reg() %>%
  set_mode("censored regression") %>%
  set_engine("survival")

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(telco_spec)

telco_fit <- fit(telco_wflow, data = telco_train)

How long is somebody going to stay as a customer?

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1     262.  
#> 2     113.  
#> 3      43.6 
#> 4       6.55
#> 5     130.

Who is likely to stop being a customer?

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = c(12, 24))

pred_survival
#> # A tibble: 5 × 1
#>   .pred           
#>   <list>          
#> 1 <tibble [2 × 2]>
#> 2 <tibble [2 × 2]>
#> 3 <tibble [2 × 2]>
#> 4 <tibble [2 × 2]>
#> 5 <tibble [2 × 2]>

Who is likely to stop being a customer?

pred_survival$.pred[[1]]
#> # A tibble: 2 × 2
#>   .eval_time .pred_survival
#>        <dbl>          <dbl>
#> 1         12          0.931
#> 2         24          0.878

tidymodels for survival analysis

  • Models:
    parametric, semi-parametric, and tree-based
  • Predictions:
    survival time, survival probability, hazard, and linear predictor
  • Metrics:
    concordance index, Brier score, integrated Brier score, AUC ROC

tidymodels for survival analysis


Learn more via articles on tidymodels.org