tidymodels for time-to-event data

EARL 2024

Hannah Frick, Posit

How long is somebody going to stay as a customer?

What if we just use the time?

That time is observation time, not time to event.

What we actually have

What if we just use the time?

If we assume that’s time-to-event, we assume everything is an event.

… discard the censored observations?

Who is likely to stop being a customer?

What if we just use the event status?

Who is likely to stop being a customer while we observe them?

Our challenge

Our outcome has two aspects: time and event status.
Our outcome may be censored: incomplete data is not missing data.

Regression and classification are not directly equipped to deal with either challenge.

Survival analysis to the rescue

Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).¹

Let’s try time windows

Let’s try time windows

Let’s try time windows

Probability over time

Two central ideas of survival analysis

Model the survival curve (or derivatives) to capture time and event status.
Censored observations are partially included, rather than discarded.

Show me some code!

Customer churn

wa_churn

#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

Customer churn

library(tidymodels)
library(censored)

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_spec <- proportional_hazards() %>%
  set_mode("censored regression") %>%
  set_engine("survival")

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(telco_spec)

telco_fit <- fit(telco_wflow, data = telco_train)

How long is somebody going to stay as a customer?

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1      61.1 
#> 2      51.9 
#> 3      36.2 
#> 4       6.85
#> 5      53.1

Who is likely to stop being a customer?

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = 1:24)

pred_survival
#> # A tibble: 5 × 1
#>   .pred            
#>   <list>           
#> 1 <tibble [24 × 2]>
#> 2 <tibble [24 × 2]>
#> 3 <tibble [24 × 2]>
#> 4 <tibble [24 × 2]>
#> 5 <tibble [24 × 2]>

Who is likely to stop being a customer?

pred_survival$.pred[[1]]
#> # A tibble: 24 × 2
#>    .eval_time .pred_survival
#>         <dbl>          <dbl>
#>  1          1          0.982
#>  2          2          0.975
#>  3          3          0.969
#>  4          4          0.963
#>  5          5          0.959
#>  6          6          0.956
#>  7          7          0.952
#>  8          8          0.949
#>  9          9          0.945
#> 10         10          0.941
#> # ℹ 14 more rows