posit::conf 2024
That time is observation time, not time to event.
If we assume that’s time-to-event, we assume everything is an event.
Who is likely to stop being a customer while we observe them?
Our outcome has two aspects: time and event status.
Our outcome may be censored: incomplete data is not missing data.
Regression and classification are not directly equipped to deal with either challenge.
Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).1
Model the survival curve (or derivatives) to capture time and event status.
Censored observations are partially included, rather than discarded.
#> # A tibble: 7,032 × 18
#> tenure churn female senior_citizen partner dependents phone_service
#> <int> <fct> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 No 1 0 1 0 0
#> 2 34 No 0 0 0 0 1
#> 3 2 Yes 0 0 0 0 1
#> 4 45 No 0 0 0 0 0
#> 5 2 Yes 1 0 0 0 1
#> 6 8 Yes 1 0 0 0 1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> # online_security <chr>, online_backup <chr>,
#> # device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> # streaming_movies <chr>, paperless_billing <dbl>,
#> # payment_method <fct>, monthly_charges <dbl>
telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>%
step_zv(all_predictors())
telco_spec <- proportional_hazards() %>%
set_mode("censored regression") %>%
set_engine("survival")
telco_wflow <- workflow() %>%
add_recipe(telco_rec) %>%
add_model(telco_spec)
telco_fit <- fit(telco_wflow, data = telco_train)
tidymodels for time-to-event data