tidymodels – Now also for time-to-event data!

useR! 2024

Hannah Frick, Posit

Time-to-event data?

Customer churn

wa_churn
#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

What might you want to model with these data?

Let’s try to predict:

  • How long is somebody going to stay as a customer?
  • Who is likely to stop being a customer?

How long is somebody going to stay as a customer?

What if we just use the time?


That time is observation time, not time to event.

What we actually have


What if we just use the time?


If we assume that’s time-to-event, we assume everything is an event.

What if we only use the event time?


Who is likely to stop being a customer?

What if we just use the event status?


Who is likely to stop being a customer while we observe them?

Our challenge

  • Time-to-event data inherently has two aspects: time and event status.
  • Censoring: incomplete data is not missing data.
  • With regression and classification we can only model one aspect, separately, without being able to properly account for the other aspect.

Survival analysis to the rescue



Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).1

tidymodels?




tidymodels is a framework for modelling and

machine learning using tidyverse principles.

Core coverage

Extendable




Focus on the modelling question,

not the infrastructure for
empirical validation.




Focus on the modelling question,

not the syntax.

tidymodels for survival analysis

  • Models:
    parametric, semi-parametric, and tree-based
  • Predictions:
    survival time, survival probability, hazard, and linear predictor
  • Metrics:
    concordance index, Brier score, integrated Brier score, AUC ROC

tidymodels for survival analysis


Show me some code!

Customer churn

library(tidymodels)
library(censored)

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_spec <- survival_reg() %>%
  set_mode("censored regression") %>%
  set_engine("survival")

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(telco_spec)

telco_fit <- fit(telco_wflow, data = telco_train)

How long is somebody going to stay as a customer?

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1     262.  
#> 2     113.  
#> 3      43.6 
#> 4       6.55
#> 5     130.

Who is likely to stop being a customer?

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = c(12, 24))

pred_survival
#> # A tibble: 5 × 1
#>   .pred           
#>   <list>          
#> 1 <tibble [2 × 2]>
#> 2 <tibble [2 × 2]>
#> 3 <tibble [2 × 2]>
#> 4 <tibble [2 × 2]>
#> 5 <tibble [2 × 2]>

Who is likely to stop being a customer?

pred_survival$.pred[[1]]
#> # A tibble: 2 × 2
#>   .eval_time .pred_survival
#>        <dbl>          <dbl>
#> 1         12          0.931
#> 2         24          0.878

tidymodels for time-to-event data

  • Censored regression lets you use all the information you have together.
  • tidymodels lets you do this within a well-designed framework for predictive modelling.

tidymodels for time-to-event data


Learn more via