tidymodels for time-to-event data

posit::conf 2024

Hannah Frick, Posit

How long is somebody going to stay as a customer?

What if we just use the time?

That time is observation time, not time to event.

What we actually have

If we assume that’s time-to-event, we assume everything is an event.

… discard the censored observations?

Who is likely to stop being a customer?

What if we just use the event status?

Who is likely to stop being a customer while we observe them?

Our challenge

  • Our outcome has two aspects: time and event status.

  • Our outcome may be censored: incomplete data is not missing data.

Regression and classification are not directly equipped to deal with either challenge.

Survival analysis to the rescue

Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).1

Let’s try time windows

Probability over time

Two central ideas of survival analysis

  • Model the survival curve (or derivatives) to capture time and event status.

  • Censored observations are partially included, rather than discarded.

Show me some code!

Customer churn

#> # A tibble: 7,032 × 18
#>   tenure churn female senior_citizen partner dependents phone_service
#>    <int> <fct>  <dbl>          <int>   <dbl>      <dbl>         <dbl>
#> 1      1 No         1              0       1          0             0
#> 2     34 No         0              0       0          0             1
#> 3      2 Yes        0              0       0          0             1
#> 4     45 No         0              0       0          0             0
#> 5      2 Yes        1              0       0          0             1
#> 6      8 Yes        1              0       0          0             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

telco_churn <- wa_churn %>% 
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"

Split the data

telco_split <- initial_split(telco_churn)

telco_train <- training(telco_split)
telco_test <- testing(telco_split)

A single model

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 

telco_spec <- proportional_hazards() %>%
  set_mode("censored regression") %>%

telco_wflow <- workflow() %>%
  add_recipe(telco_rec) %>%

telco_fit <- fit(telco_wflow, data = telco_train)

predict(telco_fit, new_data = telco_train[1:5, ], type = "time")
#> # A tibble: 5 × 1
#>   .pred_time
#>        <dbl>
#> 1      61.1 
#> 2      51.9 
#> 3      36.2 
#> 4       6.85
#> 5      53.1

pred_survival <- predict(telco_fit, new_data = telco_train[1:5, ], 
                         type = "survival", eval_time = 1:24)

#> # A tibble: 5 × 1
#>   .pred            
#>   <list>           
#> 1 <tibble [24 × 2]>
#> 2 <tibble [24 × 2]>
#> 3 <tibble [24 × 2]>
#> 4 <tibble [24 × 2]>
#> 5 <tibble [24 × 2]>

Who is likely to stop being a customer?

#> # A tibble: 24 × 2
#>    .eval_time .pred_survival
#>         <dbl>          <dbl>
#>  1          1          0.982
#>  2          2          0.975
#>  3          3          0.969
#>  4          4          0.963
#>  5          5          0.959
#>  6          6          0.956
#>  7          7          0.952
#>  8          8          0.949
#>  9          9          0.945
#> 10         10          0.941
#> # ℹ 14 more rows

Individual survival curves

  • Models:
    parametric, semi-parametric, and tree-based
  • Predictions:
    survival time, survival probability, hazard, and linear predictor
  • Metrics:
tidymodels for time-to-event data