What’s new with tidymodels?

III Congreso de R 2024

Hannah Frick, Posit

👋 Who am I?

👋 Who are we?

tidymodels is a framework for modelling and

machine learning using tidyverse principles.

What’s new with tidymodels?

New and released
New and in progress

New and released

Time-to-event data

Time-to-event data consists of two aspects: the time to and the event itself. We may or may not observe the event, leading to potentially censored observations.

Censored regression is now supported across the framework with various models and corresponding performance metrics.

Time-to-event data

The introduction to time-to-event data and models:
tidymodels for time-to-event data by yours truely at posit::conf(2024)

The details of measuring performance for time-to-event models:
Evaluation time-to-event models is hard by Max at posit::conf(2024)

Fairness

tidymodels includes metrics to support your thinking around fairness.

Fair machine learning by Simon at posit::conf(2024)

Prediction in databases

The tidypredict package let’s you calculate predictions from a parsnip model, in a database.

The new orbital package let’s you do that from a workflow, including workflows with a recipe.

tidypredict with recipes by Emil at posit::conf(2024)

New and in progress

Post-processing

What you do after the model, to the model predictions

Deliveries data

data(deliveries)

deliveries
#> # A tibble: 10,012 × 31
#>    time_to_delivery  hour day   distance item_01 item_02 item_03
#>               <dbl> <dbl> <fct>    <dbl>   <int>   <int>   <int>
#>  1             16.1  11.9 Thu       3.15       0       0       2
#>  2             22.9  19.2 Tue       3.69       0       0       0
#>  3             30.3  18.4 Fri       2.06       0       0       0
#>  4             33.4  15.8 Thu       5.97       0       0       0
#>  5             27.2  19.6 Fri       2.52       0       0       0
#>  6             19.6  13.0 Sat       3.35       1       0       0
#>  7             22.1  15.5 Sun       2.46       0       0       1
#>  8             26.6  17.0 Thu       2.21       0       0       1
#>  9             30.8  16.7 Fri       2.62       0       0       0
#> 10             17.4  11.9 Sun       2.75       0       2       1
#> # ℹ 10,002 more rows
#> # ℹ 24 more variables: item_04 <int>, item_05 <int>, item_06 <int>,
#> #   item_07 <int>, item_08 <int>, item_09 <int>, item_10 <int>,
#> #   item_11 <int>, item_12 <int>, item_13 <int>, item_14 <int>,
#> #   item_15 <int>, item_16 <int>, item_17 <int>, item_18 <int>,
#> #   item_19 <int>, item_20 <int>, item_21 <int>, item_22 <int>,
#> #   item_23 <int>, item_24 <int>, item_25 <int>, item_26 <int>, …

Deliveries data

# split into training and testing sets
set.seed(1)
delivery_split <- initial_split(deliveries)
delivery_train <- training(delivery_split)
delivery_test  <- testing(delivery_split)

# resample the training set using 10-fold cross-validation
set.seed(1)
delivery_folds <- vfold_cv(delivery_train)

A baaaad model

delivery_wflow <- workflow() %>%
  add_formula(time_to_delivery ~ .) %>%
  add_model(boost_tree(mode = "regression", trees = 3))

A baaaad model

set.seed(1)
delivery_res <- fit_resamples(
    delivery_wflow, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

What is your metric measuring?

collect_metrics(delivery_res)
#> # A tibble: 2 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 rmse    standard   9.52     10 0.0533  Preprocessor1_Model1
#> 2 rsq     standard   0.853    10 0.00357 Preprocessor1_Model1

But what about calibration?

library(probably)

collect_predictions(delivery_res) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

Post-processing model predictions

Why

Improve predictive performance
Better satisfy distributional limitations

How

Currently: via probably or dplyr
Coming: Include specification in workflow object

Meet tailor 👋

The tailor package introduces tailor objects, which compose iterative adjustments to model predictions.

tailor is to postprocessing as recipes is to preprocessing.

Meet tailor 👋

Tool	Applied to...	Initialize with...	Composes...	Train with...	Predict with...
recipes	Training data	`recipe()`	`step_*()`s	`prep()`	`bake()`
tailor	Model predictions	`tailor()`	`adjust_*()`ments	`fit()`	`predict()`

tailor with a workflow

library(tailor)

delivery_tlr <- tailor() %>% 
    adjust_numeric_calibration()

delivery_wflow_improved <- delivery_wflow %>%
  add_tailor(delivery_tlr)

set.seed(1)
delivery_res_improved <- fit_resamples(
    delivery_wflow_improved, 
    delivery_folds, 
    control = control_resamples(save_pred = TRUE)
  )

collect_predictions(delivery_res_improved) %>%
  cal_plot_regression(truth = time_to_delivery, estimate = .pred)

Post-processing via tailor

For probabilities: calibration
For transformation of probabilities to hard class predictions: thresholds, equivocal zones
For numeric outcomes: calibration, range

Post-processing in tidymodels

Implemented:

tailor
Support in workflows
Support via tune::fit_resamples()
Support via rsample for some resamples

To come:

Support via rsample for the rest of the resamples
Tuning parameters via tune

Feedback welcome

Best via Github issues on tidymodels/tailor

Sparse tibbles

Sparse data

When creating indicators for categorical variables, tokenizing text, or with graph datasets, you can easily end up with a lot of variables – and a lot of zeros within them.

The information in that data is rather sparse.

Challenges

This can be challenging in terms of memory and speed.

Address this via a different data representation.

Default: dense representation

c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 42, 0, 0, 0)

Store all 25 values.

Sparse representation

c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 42, 0, 0, 0)

Only 5 values necessary:

1 for the length: 25
2 for the locations of the non-zero values: 1, 22
2 for the non-zero values: 100, 42

Sparsity in R

The Matrix package implements sparse matrices and sparse vectors along with efficient matrix operations.

library(Matrix)

sparse_vec <- sparseVector(c(100, 42), i = c(1, 22), length = 25)
sparse_vec
#> sparse vector (nnz/length = 2/25) of class "dsparseVector"
#>  [1] 100   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
#> [17]   .   .   .   .   .  42   .   .   .

Sparsity in tidymodels

tibble(y = 1:25, x = sparse_vec)
#> Error in `tibble()`:
#> ! All columns in a tibble must be vectors.
#> ✖ Column `x` is a `dsparseVector` object.

What are we doing, what are we not doing, and why?

Sparsity in tidymodels

Goals:

Preserve existing sparsity across the framework
Make use of (i.e. add) sparsity where beneficial
Make things “easy”, e.g., let you (and us) use dplyr verbs

➔ We want sparse vectors in a tibble.

Welcome sparsevctrs

Sparse tibbles in tidymodels

Implemented:
- recipes: recipe(), prep(), and bake()
- parsnip: fit_xy() (based on engine) and predict()
- workflows: fit() (with a recipe) and predict()

To come: recipe steps

Maaaybe: formula interface for parsnip (via fit()) and workflows (via add_formula())

Feedback welcome

For sparsevctrs itself: r-lib/sparsevctrs

For sparsity in context: the corresponding repository, e.g., tidymodels/recipes

Better errors

When your recipe fails, previously

library(recipes)

data("ames", package = "modeldata")

recipe(~., data = ames) |>
  step_novel(Neighborhood, new_level = "Gilbert") |>
  prep()
#> Error in `prep()`:
#> ! Columns already contain the new level: Neighborhood

When your recipe fails, now

library(recipes)

data("ames", package = "modeldata")

recipe(~., data = ames) |>
  step_novel(Neighborhood, new_level = "Gilbert") |>
  prep()
#> Error in `step_novel()`:
#> Caused by error in `prep()`:
#> ! Columns already contain the new level: Neighborhood.

Upkeep: error edition

What makes an error helpful?

What happened?
Where did it happen?
Why did it happen?
How to fix it?

What happened?

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    rlang::abort("`plant` must be a string.")
  }
  msg <- paste("All good, the", plant, "doesn't need to be watered yet.")
  cat(paste0(msg, "\n"))
}

time_to_water(5)
#> Error in `time_to_water()`:
#> ! `plant` must be a string.

What happened?

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string.")
  }
  msg <- paste("All good, the", plant, "doesn't need to be watered yet.")
  cat(paste0(msg, "\n"))
}

time_to_water(5)
#> Error in `time_to_water()`:
#> ! `plant` must be a string.

Interlude with more cli options

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string.")
  }
  cli::cli_text("All good, the {plant} {?doesn't/don't} need to be watered yet.")
}

time_to_water("monstera")
#> All good, the monstera doesn't need to be watered yet.

time_to_water(c("monstera", "that other plant"))
#> All good, the monstera and that other plant don't need to be watered
#> yet.

For more see: ?cli::pluralization

Formatting options in cli

library(cli)

cli_text("A piece of code: {.code sum(a) / length(a)}")
#> A piece of code: `sum(a) / length(a)`

cli_text("A class: {.cls lm}")
#> A class: <lm>

cli_text("A function name: {.fn cli_text}")
#> A function name: `cli_text()`

For more see ?cli::`inline-markup`

Link to the docs

predict.model_spec <- function(object, ...) {
  cli::cli_abort(
    "You must {.fun fit} your {.help [model specification](parsnip::model_spec)}
     before you can use {.fun predict}."
  )
}

linear_reg() %>% predict()
#> Error in `predict()`:
#> ! You must `fit()` your model specification
#>   (`?parsnip::model_spec()`) before you can use `predict()`.

Where did it happened?

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string.")
  }
  cli::cli_text("All good, the {plant} {?doesn't/don't} need to be watered yet.")
}

plant_care <- function(plant) {
  time_to_water(plant)
  #time_to_repot(plant)
}

plant_care(5)
#> Error in `time_to_water()`:
#> ! `plant` must be a string.

Where did it happened?

time_to_water <- function(plant, call = rlang::caller_env()) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string.", call = call)
  }
  cli::cli_text("All good, the {plant} {?doesn't/don't} need to be watered yet.")
}

plant_care <- function(plant) {
  time_to_water(plant)
  #time_to_repot(plant)
}

plant_care(5)
#> Error in `plant_care()`:
#> ! `plant` must be a string.

Why did it happen?

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string.")
  }
  cli::cli_text("All good, the {plant} {?doesn't/don't} need to be watered yet.")
}

time_to_water(5)
#> Error in `time_to_water()`:
#> ! `plant` must be a string.

Why did it happen?

time_to_water <- function(plant) {
  if (!is.character(plant)) {
    cli::cli_abort("{.arg plant} must be a string, not {.obj_type_friendly {plant}}.")
  }
  cli::cli_text("All good, the {plant} {?doesn't/don't} need to be watered yet.")
}

time_to_water(5)
#> Error in `time_to_water()`:
#> ! `plant` must be a string, not a number.

lm(mpg ~., mtcars) |> time_to_water()
#> Error in `time_to_water()`:
#> ! `plant` must be a string, not a <lm> object.

How to fix it?

time_to_water <- function(plant) {

  cli::cli_abort(
    c(
        "It's time to water the {plant}.",
        i = "Look for sad leaves to avoid watering too late."
    )
  )
}

time_to_water("lily")
#> Error in `time_to_water()`:
#> ! It's time to water the lily.
#> ℹ Look for sad leaves to avoid watering too late.

Helpful errors, revisited

What happened?
→ Error message. cli has: styling, interpolation, pluralization, links
Where did it happen?
→ Call. Thread it through.
Why did it happen?
→ Provide not only what was supposed to happen but also what did happen.
How to fix it?
→ Help with hints and links, in bulleted form.

Where to get the news

Announcements on https://www.tidyverse.org/blog/
Detailed articles on https://www.tidymodels.org/