tidymodels – Now also for time-to-event data!

useR! 2024

Hannah Frick, Posit

Time-to-event data?

Customer churn

wa_churn

#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

What might you want to model with these data?

Let’s try to predict:

How long is somebody going to stay as a customer?

Who is likely to stop being a customer?

How long is somebody going to stay as a customer?

What if we just use the time?

That time is observation time, not time to event.

What we actually have

What if we just use the time?

If we assume that’s time-to-event, we assume everything is an event.

What if we only use the event time?

Who is likely to stop being a customer?

What if we just use the event status?

Who is likely to stop being a customer while we observe them?

Our challenge

Time-to-event data inherently has two aspects: time and event status.
Censoring: incomplete data is not missing data.

With regression and classification we can only model one aspect, separately, without being able to properly account for the other aspect.

Survival analysis to the rescue

Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).¹

tidymodels?

tidymodels is a framework for modelling and

machine learning using tidyverse principles.

Core coverage

Extendable

Focus on the modelling question,

not the infrastructure for
empirical validation.

Focus on the modelling question,

not the syntax.

tidymodels for survival analysis

Models:
parametric, semi-parametric, and tree-based
Predictions:
survival time, survival probability, hazard, and linear predictor
Metrics:
concordance index, Brier score, integrated Brier score, AUC ROC

tidymodels for survival analysis

Show me some code!

Customer churn

library(tidymodels)
library(censored)

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_spec <- survival_reg() %>%
  set_mode("censored regression") %>%
  set_engine("survival")

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(telco_spec)

telco_fit <- fit(telco_wflow, data = telco_train)

How long is somebody going to stay as a customer?

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1     262.  
#> 2     113.  
#> 3      43.6 
#> 4       6.55
#> 5     130.

Who is likely to stop being a customer?

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = c(12, 24))

pred_survival
#> # A tibble: 5 × 1
#>   .pred           
#>   <list>          
#> 1 <tibble [2 × 2]>
#> 2 <tibble [2 × 2]>
#> 3 <tibble [2 × 2]>
#> 4 <tibble [2 × 2]>
#> 5 <tibble [2 × 2]>

Who is likely to stop being a customer?

pred_survival$.pred[[1]]
#> # A tibble: 2 × 2
#>   .eval_time .pred_survival
#>        <dbl>          <dbl>
#> 1         12          0.931
#> 2         24          0.878

tidymodels for time-to-event data

Censored regression lets you use all the information you have together.

tidymodels lets you do this within a well-designed framework for predictive modelling.

tidymodels for time-to-event data

Learn more via

Articles on tidymodels.org/learn with the survival analysis tag

The useR! tutorial: Survival analysis with tidymodels