3 - What makes a model?

Survival analysis with tidymodels

Your turn

How do you fit a linear model in R?

How many different ways can you think of?

03:00

lm for linear model
glm for generalized linear model (e.g. logistic regression)
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets

To specify a model

Choose a model
Specify an engine
Set the mode

icon of a black cat walking to the right

To specify a model

proportional_hazards()
#> Proportional Hazards Model Specification (censored regression)
#> 
#> Computational engine: survival

To specify a model

proportional_hazards() |>
  set_engine("glmnet")
#> Proportional Hazards Model Specification (censored regression)
#> 
#> Computational engine: glmnet

To specify a model

decision_tree()
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart

To specify a model

decision_tree() |> 
  set_mode("censored regression")
#> Decision Tree Model Specification (censored regression)
#> 
#> Computational engine: rpart

All available models are listed at https://www.tidymodels.org/find/parsnip/

Your turn

Write the specification for a proportional hazards model.

Choose your engine.

All available models are listed at https://www.tidymodels.org/find/parsnip/

Extension/Challenge: Edit this code to use a different model. For example, try using a conditional inference tree as implemented in the partykit package - or try an entirely different model type!

03:00

Models we’ll be using today

Proportional hazards (PH) model
Decision tree

Proportional hazards model

Hazard modeled via a baseline hazard and a linear combination of predictors:

\(\lambda(t | x_i) = \lambda_0(t) \cdot \exp (x_i^T \beta)\)

Proportional hazards model

Hazard modeled via a baseline hazard and a linear combination of predictors:

\(\lambda(t | x_i) = \lambda_0(t) \cdot \exp (x_i^T \beta)\)

The hazard is proportional over all time \(t\), and thus also the probability of survival is proportional.

Decision tree

Series of splits or if/then statements based on predictors
First the tree grows until some condition is met (maximum depth, no more data)
Then the tree is pruned to reduce its complexity

Decision tree

All models are wrong, but some are useful!

PH model

Decision tree

A model workflow

Workflows bind preprocessors and models

Fit a model spec

tree_spec <-
  decision_tree() |> 
  set_mode("censored regression")

tree_spec |> 
  fit(event_time ~ ., data = cat_train) 
#> parsnip model object
#> 
#> $rpart
#> n= 1765 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#> 1) root 1765 1981.0090 1.0000000  
#>   2) neutered=no,unknown 696  576.4117 0.5937228 *
#>   3) neutered=yes 1069 1318.2450 1.1714650  
#>     6) intake_condition=ill_mild,ill_moderatete,normal,under_age_or_weight,other 921 1063.5830 1.0808970 *
#>     7) intake_condition=feral,fractious 148  195.7649 2.3866130 *
#> 
#> $survfit
#> 
#> Call: prodlim::prodlim(formula = form, data = data)
#> 
#> Right-censored response of a survival model
#> 
#> No.Observations: 1765 
#> 
#> Pattern:
#>                 Freq
#>  event          1116
#>  right.censored 649 
#> 
#> $levels
#> [1] "0.593722848021764" "1.08089686322063"  "2.38661313111009" 
#> 
#> attr(,"class")
#> [1] "pecRpart"

Fit a model workflow

tree_spec <-
  decision_tree() |> 
  set_mode("censored regression")

workflow() |>
  add_formula(event_time ~ .) |>
  add_model(tree_spec) |>
  fit(data = cat_train) 
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: decision_tree()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> event_time ~ .
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> $rpart
#> n= 1765 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#> 1) root 1765 1981.0090 1.0000000  
#>   2) neutered=no,unknown 696  576.4117 0.5937228 *
#>   3) neutered=yes 1069 1318.2450 1.1714650  
#>     6) intake_condition=ill_mild,ill_moderatete,normal,under_age_or_weight,other 921 1063.5830 1.0808970 *
#>     7) intake_condition=feral,fractious 148  195.7649 2.3866130 *
#> 
#> $survfit
#> 
#> Call: prodlim::prodlim(formula = form, data = data)
#> 
#> Right-censored response of a survival model
#> 
#> No.Observations: 1765 
#> 
#> Pattern:
#>                 Freq
#>  event          1116
#>  right.censored 649 
#> 
#> $levels
#> [1] "0.593722848021763" "1.08089686322063"  "2.38661313111009" 
#> 
#> attr(,"class")
#> [1] "pecRpart"

Fit a model workflow

tree_spec <-
  decision_tree() |> 
  set_mode("censored regression")

tree_fit <-
  workflow(event_time ~ ., tree_spec) |> 
  fit(data = cat_train)

Your turn

Make a workflow with your own model of choice.

Extension/Challenge: Other than formulas, what kinds of preprocessors are supported?

03:00

Predict with your model

How do you use your new tree_fit model?

predict(tree_fit, new_data = demo_cats)
#> # A tibble: 50 × 1
#>    .pred_time
#>         <dbl>
#>  1      1.08 
#>  2      0.594
#>  3      0.594
#>  4      1.08 
#>  5      2.39 
#>  6      0.594
#>  7      1.08 
#>  8      0.594
#>  9      1.08 
#> 10      0.594
#> # ℹ 40 more rows

Predict with your model

How do you use your new tree_fit model?

predict(tree_fit, new_data = demo_cats, type = "time")
#> # A tibble: 50 × 1
#>    .pred_time
#>         <dbl>
#>  1      1.08 
#>  2      0.594
#>  3      0.594
#>  4      1.08 
#>  5      2.39 
#>  6      0.594
#>  7      1.08 
#>  8      0.594
#>  9      1.08 
#> 10      0.594
#> # ℹ 40 more rows

Predict with your model

preds <- predict(tree_fit, new_data = demo_cats, type = "survival", 
                 eval_time = seq(0, 365, by = 5))
preds
#> # A tibble: 50 × 1
#>    .pred            
#>    <list>           
#>  1 <tibble [74 × 2]>
#>  2 <tibble [74 × 2]>
#>  3 <tibble [74 × 2]>
#>  4 <tibble [74 × 2]>
#>  5 <tibble [74 × 2]>
#>  6 <tibble [74 × 2]>
#>  7 <tibble [74 × 2]>
#>  8 <tibble [74 × 2]>
#>  9 <tibble [74 × 2]>
#> 10 <tibble [74 × 2]>
#> # ℹ 40 more rows

Predict with your model

preds$.pred[[1]]
#> # A tibble: 74 × 2
#>    .eval_time .pred_survival
#>         <dbl>          <dbl>
#>  1          0          1    
#>  2          5          1    
#>  3         10          0.923
#>  4         15          0.813
#>  5         20          0.749
#>  6         25          0.690
#>  7         30          0.637
#>  8         35          0.588
#>  9         40          0.536
#> 10         45          0.495
#> # ℹ 64 more rows

Predict with your model

Your turn

Run:

augment(tree_fit, new_data = demo_cats,
eval_time = seq(0, 365, by = 5))

What do you get?

03:00

The tidymodels prediction guarantee!

The predictions will always be inside a tibble
The column names and types are unsurprising and predictable
The number of rows in new_data and the output are the same

3 - What makes a model?

Your turn

To specify a model

To specify a model

To specify a model

To specify a model

To specify a model

Your turn

Models we’ll be using today

Proportional hazards model

Proportional hazards model

Proportional hazards model

Decision tree

Decision tree

All models are wrong, but some are useful!

PH model

Decision tree

A model workflow

Workflows bind preprocessors and models

Fit a model spec

Fit a model workflow

Fit a model workflow

Your turn

Predict with your model

Predict with your model

Predict with your model

Predict with your model

Predict with your model

Your turn

The tidymodels prediction guarantee!

The whole game - status update