👋 Who am I?

👋 Who are we?

👋 Who are you?


  • What is your role?
  • What kind of problems do you work on?

Roadmap

  • What is tidymodels?
  • Why tidymodels?
  • Applied example 📞
  • Resources

github.com/hfrick/2024-tidymodels-emdserono

What is tidymodels?


The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

- tidymodels.org


…so what is modeling and machine learning?

BYO Venn Diagram



The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.


🥴



The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.


🥴



The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.


🥴



The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.


🥴



The tidymodels framework is a collection of packages for safe, performant, and expressive supervised predictive modeling on tabular data.


🥴


Think about the modeling problem, not the syntax.

Why tidymodels?

Why tidymodels?  Consistency

How many different ways can you think of to fit a linear model in R?

The blessing:

  • Many statistical modeling practitioners implement methods in R

The curse:

  • Many statistical modeling practitioners implement methods in R

Why tidymodels?  Consistency

mtcars
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4
#>                     carb
#> Mazda RX4              4
#> Mazda RX4 Wag          4
#> Datsun 710             1
#> Hornet 4 Drive         1
#> Hornet Sportabout      2
#> Valiant                1
#> Duster 360             4
#> Merc 240D              2
#> Merc 230               2
#> Merc 280               4
#> Merc 280C              4
#> Merc 450SE             3
#> Merc 450SL             3
#> Merc 450SLC            3
#> Cadillac Fleetwood     4
#> Lincoln Continental    4
#> Chrysler Imperial      4
#> Fiat 128               1
#> Honda Civic            2
#> Toyota Corolla         1
#> Toyota Corona          1
#> Dodge Challenger       2
#> AMC Javelin            2
#> Camaro Z28             4
#> Pontiac Firebird       2
#> Fiat X1-9              1
#> Porsche 914-2          2
#> Lotus Europa           2
#> Ford Pantera L         4
#> Ferrari Dino           6
#> Maserati Bora          8
#> Volvo 142E             2

Why tidymodels?  Consistency

With lm():

model <- 
  lm(mpg ~ ., mtcars)

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Consistency

With glmnet:

model <- 
  glmnet(
    as.matrix(mtcars[2:11]),
    mtcars$mpg
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("glmnet") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Consistency

With h2o:

h2o.init()
as.h2o(mtcars, "cars")

model <- 
  h2o.glm(
    x = colnames(mtcars[2:11]), 
    y = "mpg",
    "cars"
  )

With tidymodels:

model <-
  linear_reg() %>%
  set_engine("h2o") %>%
  fit(mpg ~ ., mtcars)

Why tidymodels?  Safety1

Why tidymodels?  Safety1

  • Overfitting leads to analysts believing models are more performant than they actually are.
  • A 2023 review found data leakage to be “a widespread failure mode in machine-learning (ML)-based science.”
  • Implementations of the same machine learning model give differing results, resulting in irreproducibility of modeling results.

Why tidymodels?  Completeness

Why tidymodels?  Completeness

Built-in support for 99 machine learning models!

#> # A tibble: 99 × 2
#>    name       engine   
#>    <chr>      <chr>    
#>  1 boost_tree C5.0     
#>  2 boost_tree h2o      
#>  3 boost_tree h2o_gbm  
#>  4 boost_tree lightgbm 
#>  5 boost_tree mboost   
#>  6 boost_tree spark    
#>  7 boost_tree xgboost  
#>  8 null_model parsnip  
#>  9 svm_linear LiblineaR
#> 10 svm_linear kernlab  
#> # ℹ 89 more rows

Why tidymodels?  Completeness

Built-in support for 102 data pre-processing techniques!

#> # A tibble: 102 × 1
#>    name               
#>    <chr>              
#>  1 step_rename_at     
#>  2 step_scale         
#>  3 step_kpca          
#>  4 step_percentile    
#>  5 step_depth         
#>  6 step_poly_bernstein
#>  7 step_impute_linear 
#>  8 step_novel         
#>  9 step_nnmf_sparse   
#> 10 step_slice         
#> # ℹ 92 more rows

Why tidymodels?  Extensibility

Can’t find the technique you need?

Applied example

Coming to tidymodels: Survival analysis

  • For time-to-event data with censoring
  • Release cascade underway!
  • Dedicated models and metrics
  • General framework goodies unlocked 🎉

Try it out yourself


Install the release version

pak::pak("tidymodels")


Install the development version of

pak::pak(paste0("tidymodels/", c("tune", "finetune", "workflowsets")))

Customer churn

wa_churn
#> # A tibble: 7,032 × 18
#>   churn female senior_citizen partner dependents tenure phone_service
#>   <fct>  <dbl>          <int>   <dbl>      <dbl>  <int>         <dbl>
#> 1 No         1              0       1          0      1             0
#> 2 No         0              0       0          0     34             1
#> 3 Yes        0              0       0          0      2             1
#> 4 No         0              0       0          0     45             0
#> 5 Yes        1              0       0          0      2             1
#> 6 Yes        1              0       0          0      8             1
#> # ℹ 7,026 more rows
#> # ℹ 11 more variables: multiple_lines <chr>, internet_service <fct>,
#> #   online_security <chr>, online_backup <chr>,
#> #   device_protection <chr>, tech_support <chr>, streaming_tv <chr>,
#> #   streaming_movies <chr>, paperless_billing <dbl>,
#> #   payment_method <fct>, monthly_charges <dbl>

See /example/churn.R for the actual code to generate this data!

Customer churn

Around 26.6% of customers have churned.

Customer churn

telco_churn <- wa_churn %>% 
  mutate(
    churn_surv = Surv(tenure, if_else(churn == "Yes", 1, 0)),
    .keep = "unused"
  )

Split the data

set.seed(403)
telco_split <- initial_split(telco_churn)
telco_train <- training(telco_split)
telco_test <- testing(telco_split)

telco_rs <- vfold_cv(telco_train)

Customer churn

Customer churn

Preprocessing

telco_rec <- recipe(churn_surv ~ ., data = telco_train) %>% 
  step_zv(all_predictors()) 

telco_rec_streaming <- telco_rec %>%
  step_mutate(
    streaming = factor(if_else(streaming_tv == "Yes" | 
                                 streaming_movies == "Yes", "Yes", "No"))
  ) %>% 
  step_rm(streaming_tv, streaming_movies)

Baseline model

spec_surv_reg <- survival_reg()

wflow_surv_reg <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(spec_surv_reg)

Baseline model

set.seed(12)
sr_rs_fit <- fit_resamples(
  wflow_surv_reg, 
  telco_rs, 
  metrics = metric_set(brier_survival_integrated,
                       brier_survival,
                       roc_auc_survival,
                       concordance_survival), 
  eval_time = c(1, 6, 12, 18, 24, 36, 48, 60)
)

Baseline model

collect_metrics(sr_rs_fit)
#> # A tibble: 18 × 7
#>    .metric          .estimator .eval_time   mean     n std_err .config
#>    <chr>            <chr>           <dbl>  <dbl> <int>   <dbl> <chr>  
#>  1 brier_survival   standard            1 0.0470    10 0.00303 Prepro…
#>  2 roc_auc_survival standard            1 0.842     10 0.0102  Prepro…
#>  3 brier_survival   standard            6 0.0792    10 0.00224 Prepro…
#>  4 roc_auc_survival standard            6 0.853     10 0.00517 Prepro…
#>  5 brier_survival   standard           12 0.0930    10 0.00206 Prepro…
#>  6 roc_auc_survival standard           12 0.864     10 0.00567 Prepro…
#>  7 brier_survival   standard           18 0.0996    10 0.00187 Prepro…
#>  8 roc_auc_survival standard           18 0.878     10 0.00525 Prepro…
#>  9 brier_survival   standard           24 0.105     10 0.00197 Prepro…
#> 10 roc_auc_survival standard           24 0.884     10 0.00484 Prepro…
#> 11 brier_survival   standard           36 0.107     10 0.00305 Prepro…
#> 12 roc_auc_survival standard           36 0.903     10 0.00546 Prepro…
#> 13 brier_survival   standard           48 0.109     10 0.00341 Prepro…
#> 14 roc_auc_survival standard           48 0.916     10 0.00554 Prepro…
#> 15 brier_survival   standard           60 0.103     10 0.00345 Prepro…
#> 16 roc_auc_survival standard           60 0.946     10 0.00452 Prepro…
#> 17 brier_survival_… standard           NA 0.0979    10 0.00173 Prepro…
#> 18 concordance_sur… standard           NA 0.826     10 0.00440 Prepro…

show_best(sr_rs_fit, metric = "brier_survival_integrated")
#> # A tibble: 1 × 7
#>   .metric           .estimator .eval_time   mean     n std_err .config
#>   <chr>             <chr>           <dbl>  <dbl> <int>   <dbl> <chr>  
#> 1 brier_survival_i… standard           NA 0.0979    10 0.00173 Prepro…

Tune a model

spec_tree <- 
  decision_tree(
    tree_depth = tune(), 
    min_n = tune(),
    cost_complexity = tune()
  ) %>% 
  set_engine("rpart") %>%
  set_mode("censored regression")

wflow_tree <- workflow() %>%
  add_recipe(telco_rec) %>%
  add_model(spec_tree)

Tune a model

set.seed(12) 
tree_res <- tune_grid(
  wflow_tree, 
  telco_rs, 
  grid = 10,
  metrics = metric_set(brier_survival_integrated, 
                       brier_survival,
                       roc_auc_survival, 
                       concordance_survival), 
  eval_time = c(1, 6, 12, 18, 24, 36, 48, 60)
)

Tune a model

show_best(tree_res, metric = "brier_survival_integrated")
#> # A tibble: 5 × 10
#>   cost_complexity tree_depth min_n .metric .estimator .eval_time  mean
#>             <dbl>      <int> <int> <chr>   <chr>           <dbl> <dbl>
#> 1        2.06e- 5          7     4 brier_… standard           NA 0.110
#> 2        1.15e- 6         13    37 brier_… standard           NA 0.111
#> 3        1.95e- 7          6    18 brier_… standard           NA 0.112
#> 4        2.73e- 9         15    32 brier_… standard           NA 0.113
#> 5        3.86e-10         12    22 brier_… standard           NA 0.114
#> # ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>

Finalize a model

best_param <- select_best(tree_res, metric = "brier_survival_integrated")
wflow_tree <- finalize_workflow(wflow_tree, best_param)

churn_mod <- fit(wflow_tree, telco_train)

predict(churn_mod, telco_test, type = "time")
#> # A tibble: 1,758 × 1
#>    .pred_time
#>         <dbl>
#>  1     0.211 
#>  2     1.95  
#>  3     0.0317
#>  4     0.211 
#>  5     0.325 
#>  6     5.37  
#>  7     0.349 
#>  8     3.27  
#>  9     1.84  
#> 10     5.37  
#> # ℹ 1,748 more rows

workflowsets - The kitchen sink

Preprocessors:

  • formula
  • combine streaming indicators
  • center and scale predictors
  • PCA + center and scale predictors

…and models:

  • parametric survival regression
  • proportional hazards model
  • decision tree
  • random forest
  • bagged decision tree
  • boosted tree

Resources

Resources

  • tidyverse: r4ds.hadley.nz

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org

Resources

  • tidyverse: r4ds.hadley.nz
  • tidymodels: tmwr.org
  • Slides and code:
github.com/hfrick/2024-tidymodels-emdserono

Thank you!