👋 Who am I?

👋 Who are we?

👋 Who are you?

What is your role?
What kind of problems do you work on?

Roadmap

What is tidymodels?
Why tidymodels?
Applied example 📞
Resources

github.com/hfrick/2024-tidymodels-emdserono

What is tidymodels?

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

- tidymodels.org

…so what is modeling and machine learning?

BYO Venn Diagram

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.

🥴

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.

🥴

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.

🥴

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.

🥴

The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.

🥴

Think about the modeling problem, not the syntax.

Why tidymodels?

Why tidymodels? Consistency

How many different ways can you think of to fit a linear model in R?

The blessing:

Many statistical modeling practitioners implement methods in R

The curse:

Many statistical modeling practitioners implement methods in R

Why tidymodels? Consistency

mtcars
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4
#>                     carb
#> Mazda RX4              4
#> Mazda RX4 Wag          4
#> Datsun 710             1
#> Hornet 4 Drive         1
#> Hornet Sportabout      2
#> Valiant                1
#> Duster 360             4
#> Merc 240D              2
#> Merc 230               2
#> Merc 280               4
#> Merc 280C              4
#> Merc 450SE             3
#> Merc 450SL             3
#> Merc 450SLC            3
#> Cadillac Fleetwood     4
#> Lincoln Continental    4
#> Chrysler Imperial      4
#> Fiat 128               1
#> Honda Civic            2
#> Toyota Corolla         1
#> Toyota Corona          1
#> Dodge Challenger       2
#> AMC Javelin            2
#> Camaro Z28             4
#> Pontiac Firebird       2
#> Fiat X1-9              1
#> Porsche 914-2          2
#> Lotus Europa           2
#> Ford Pantera L         4
#> Ferrari Dino           6
#> Maserati Bora          8
#> Volvo 142E             2

Why tidymodels? Consistency

With lm():

model <- 
  lm(mpg ~ ., mtcars)

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Consistency

With glmnet:

model <- 
  glmnet(
    as.matrix(mtcars[2:11]),
    mtcars$mpg
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("glmnet") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Consistency

With h2o:

h2o.init()
as.h2o(mtcars, "cars")

model <- 
  h2o.glm(
    x = colnames(mtcars[2:11]), 
    y = "mpg",
    "cars"
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("h2o") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels? Safety¹

Overfitting leads to analysts believing models are more performant than they actually are.

A 2023 review found data leakage to be “a widespread failure mode in machine-learning (ML)-based science.”

Implementations of the same machine learning model give differing results, resulting in irreproducibility of modeling results.

Why tidymodels? Completeness

Built-in support for 99 machine learning models!

#> # A tibble: 99 × 2
#>    name       engine   
#>    <chr>      <chr>    
#>  1 boost_tree C5.0     
#>  2 boost_tree h2o      
#>  3 boost_tree h2o_gbm  
#>  4 boost_tree lightgbm 
#>  5 boost_tree mboost   
#>  6 boost_tree spark    
#>  7 boost_tree xgboost  
#>  8 null_model parsnip  
#>  9 svm_linear LiblineaR
#> 10 svm_linear kernlab  
#> # ℹ 89 more rows

Why tidymodels? Completeness

Built-in support for 102 data pre-processing techniques!

#> # A tibble: 102 × 1
#>    name               
#>    <chr>              
#>  1 step_rename_at     
#>  2 step_scale         
#>  3 step_kpca          
#>  4 step_percentile    
#>  5 step_depth         
#>  6 step_poly_bernstein
#>  7 step_impute_linear 
#>  8 step_novel         
#>  9 step_nnmf_sparse   
#> 10 step_slice         
#> # ℹ 92 more rows

Why tidymodels? Extensibility

Can’t find the technique you need?

Applied example

Coming to tidymodels: Survival analysis

For time-to-event data with censoring
Release cascade underway!
Dedicated models and metrics
General framework goodies unlocked 🎉

Try it out yourself

Install the release version

pak::pak("tidymodels")

Install the development version of

pak::pak(paste0("tidymodels/", c("tune", "finetune", "workflowsets")))

Customer churn

wa_churn

#> # A tibble: 7,032 × 18
#>   churn female senior_citizen partner dependents tenure phone_service
#>   <fct>  <dbl>          <int>   <dbl>      <dbl>  <int>         <dbl>
#> 1 No         1              0       1          0      1             0
#> 2 No         0              0       0          0     34             1
#> 3 Yes        0              0       0          0      2             1
#> 4 No         0              0       0          0     45             0
#> 5 Yes        1              0       0          0      2             1
#> 6 Yes        1              0       0          0      8             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

See /example/churn.R for the actual code to generate this data!

Customer churn

Around 26.6% of customers have churned.

Customer churn

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)
telco_train <- training(telco_split)
telco_test <- testing(telco_split)

telco_rs <- vfold_cv(telco_train)

Customer churn

Preprocessing

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_rec_streaming <- telco_rec %>%
  step_mutate(
    streaming = factor(if_else(streaming_tv == "Yes" | 
                                 streaming_movies == "Yes", "Yes", "No"))
  ) %>% 
  step_rm(streaming_tv, streaming_movies)

Baseline model

spec_surv_reg <- survival_reg()

wflow_surv_reg <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(spec_surv_reg)

Baseline model

set.seed(12)
sr_rs_fit <- fit_resamples(
  wflow_surv_reg, 
  telco_rs, 
  metrics = metric_set(brier_survival_integrated,
                       brier_survival,
                       roc_auc_survival,
                       concordance_survival), 
  eval_time = c(1, 6, 12, 18, 24, 36, 48, 60)
)

Baseline model

collect_metrics(sr_rs_fit)
#> # A tibble: 18 × 7
#>    .metric          .estimator .eval_time   mean     n std_err .config
#>    <chr>            <chr>           <dbl>  <dbl> <int>   <dbl> <chr>  
#>  1 brier_survival   standard            1 0.0470    10 0.00303 Prepro…
#>  2 roc_auc_survival standard            1 0.842     10 0.0102  Prepro…
#>  3 brier_survival   standard            6 0.0792    10 0.00224 Prepro…
#>  4 roc_auc_survival standard            6 0.853     10 0.00517 Prepro…
#>  5 brier_survival   standard           12 0.0930    10 0.00206 Prepro…
#>  6 roc_auc_survival standard           12 0.864     10 0.00567 Prepro…
#>  7 brier_survival   standard           18 0.0996    10 0.00187 Prepro…
#>  8 roc_auc_survival standard           18 0.878     10 0.00525 Prepro…
#>  9 brier_survival   standard           24 0.105     10 0.00197 Prepro…
#> 10 roc_auc_survival standard           24 0.884     10 0.00484 Prepro…
#> 11 brier_survival   standard           36 0.107     10 0.00305 Prepro…
#> 12 roc_auc_survival standard           36 0.903     10 0.00546 Prepro…
#> 13 brier_survival   standard           48 0.109     10 0.00341 Prepro…
#> 14 roc_auc_survival standard           48 0.916     10 0.00554 Prepro…
#> 15 brier_survival   standard           60 0.103     10 0.00345 Prepro…
#> 16 roc_auc_survival standard           60 0.946     10 0.00452 Prepro…
#> 17 brier_survival_… standard           NA 0.0979    10 0.00173 Prepro…
#> 18 concordance_sur… standard           NA 0.826     10 0.00440 Prepro…

show_best(sr_rs_fit, metric = "brier_survival_integrated")
#> # A tibble: 1 × 7
#>   .metric           .estimator .eval_time   mean     n std_err .config
#>   <chr>             <chr>           <dbl>  <dbl> <int>   <dbl> <chr>  
#> 1 brier_survival_i… standard           NA 0.0979    10 0.00173 Prepro…

Tune a model

spec_tree <- 
  decision_tree(
    tree_depth = tune(), 
    min_n = tune(),
    cost_complexity = tune()
  ) %>% 
  set_engine("rpart") %>%
  set_mode("censored regression")

wflow_tree <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(spec_tree)

Tune a model

set.seed(12) 
tree_res <- tune_grid(
  wflow_tree, 
  telco_rs, 
  grid = 10,
  metrics = metric_set(brier_survival_integrated, 
                       brier_survival,
                       roc_auc_survival, 
                       concordance_survival), 
  eval_time = c(1, 6, 12, 18, 24, 36, 48, 60)
)

Tune a model

show_best(tree_res, metric = "brier_survival_integrated")
#> # A tibble: 5 × 10
#>   cost_complexity tree_depth min_n .metric .estimator .eval_time  mean
#>             <dbl>      <int> <int> <chr>   <chr>           <dbl> <dbl>
#> 1        2.06e- 5          7     4 brier_… standard           NA 0.110
#> 2        1.15e- 6         13    37 brier_… standard           NA 0.111
#> 3        1.95e- 7          6    18 brier_… standard           NA 0.112
#> 4        2.73e- 9         15    32 brier_… standard           NA 0.113
#> 5        3.86e-10         12    22 brier_… standard           NA 0.114
#> # ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>

Finalize a model

best_param <- select_best(tree_res, metric = "brier_survival_integrated")
wflow_tree <- finalize_workflow(wflow_tree, best_param)

churn_mod <- fit(wflow_tree, telco_train)

predict(churn_mod, telco_test, type = "time")
#> # A tibble: 1,758 × 1
#>    .pred_time
#>         <dbl>
#>  1     0.211 
#>  2     1.95  
#>  3     0.0317
#>  4     0.211 
#>  5     0.325 
#>  6     5.37  
#>  7     0.349 
#>  8     3.27  
#>  9     1.84  
#> 10     5.37  
#> # ℹ 1,748 more rows

workflowsets - The kitchen sink

Preprocessors:

formula
combine streaming indicators
center and scale predictors
PCA + center and scale predictors

…and models:

parametric survival regression
proportional hazards model
decision tree
random forest
bagged decision tree
boosted tree

Resources

tidyverse: r4ds.hadley.nz

Resources

tidyverse: r4ds.hadley.nz
tidymodels: tmwr.org

Resources

tidyverse: r4ds.hadley.nz
tidymodels: tmwr.org
Slides and code:

github.com/hfrick/2024-tidymodels-emdserono

Thank you!

👋 Who am I?

👋 Who are we?

👋 Who are you?

Roadmap

What is tidymodels?

BYO Venn Diagram

Why tidymodels?

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Consistency

Why tidymodels? Safety1

Why tidymodels? Safety1

Why tidymodels? Completeness

Why tidymodels? Completeness

Why tidymodels? Completeness

Why tidymodels? Extensibility

Applied example

Coming to tidymodels: Survival analysis

Try it out yourself

Customer churn

Customer churn

Customer churn

Split the data

Customer churn

Customer churn

Preprocessing

Baseline model

Baseline model

Baseline model

Tune a model

Tune a model

Tune a model

Finalize a model

workflowsets - The kitchen sink

Resources

Resources

Resources

Resources

Why tidymodels? Safety¹

Why tidymodels? Safety¹