SatRdays London 2024
tidymodels is a framework for modelling and
machine learning using tidyverse principles.
Focus on the modelling question,
not the syntax.
Focus on the modelling question,
not the infrastructure for
empirical validation.
resampling, preprocessing, models, metrics, tuning strategies
#> # A tibble: 7,032 × 18
#> tenure churn female senior_citizen partner dependents phone_service
#> <int> <fct> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 No 1 0 1 0 0
#> 2 34 No 0 0 0 0 1
#> 3 2 Yes 0 0 0 0 1
#> 4 45 No 0 0 0 0 0
#> 5 2 Yes 1 0 0 0 1
#> 6 8 Yes 1 0 0 0 1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> # online_security <chr>, online_backup <chr>,
#> # device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> # streaming_movies <chr>, paperless_billing <dbl>,
#> # payment_method <fct>, monthly_charges <dbl>
Let’s try to predict:
That time is observation time, not time to event.
If we assume that’s time-to-event, we assume everything is an event.
If we use regression to model time-to-event data, we might
Who is likely to stop being a customer while we observe them?
If we use classification to model (time-to-)event data, we
ignore the (possibly wildly) different observation length.
Survival analysis is unique because it simultaneously considers if events happened (i.e. a binary outcome) and when events happened (e.g. a continuous outcome).1
telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>%
step_zv(all_predictors())
telco_spec <- survival_reg() %>%
set_mode("censored regression") %>%
set_engine("survival")
telco_wflow <- workflow() %>%
add_recipe(telco_rec) %>%
add_model(telco_spec)
telco_fit <- fit(telco_wflow, data = telco_train)