Cross-Validation

The list column for splits contains the information on which rows belong in the analysis and assessment sets. There are functions that can be used to extract the individual resampled data called analysis() and assessment().

Cross-Validation

Cross-Validation techniques creates a series of data sets similar to the training set. A subset of the data are used for creating the model and a different subset is used to measure performance.

Let’use a 10-folds cross-validation (CV). This method randomly allocate the training set (\(750,000\) observations) to 10 groups of roughly equal size, called “folds”. Therefore, the first fold holds \(75,000\) observations randomly selected and are kept held out of the training process for the purpose of measuring performance. These data are called assessment set in tidymodels framework. The other \(90%\) of the data inside the training_set are used to fit the model, called analysis set in tidymodels. This model, trained on the analysis set, is applied to the assessment set to generate predictions, and performance statistics are computed based on those predictions.

Then, in the second fold, a new and different assessment set is generated from the training_set

The fitting process moves iteratively through these 10-fold CV. At the end of the process

In this example, 10-fold CV moves iteratively through the folds and leaves a different 10% out each time for model assessment. At the end of this process, there are 10 sets of performance statistics that were created on 10 data sets that were not used in the modeling process. For the cell example, this means 10 accuracies and 10 areas under the ROC curve. While 10 models were created, these are not used further; we do not keep the models themselves trained on these folds because their only purpose is calculating performance metrics.

Splitting Data with tidymodels::rsample

rsample create an R object, data_split in our case, that contains the information on how to split the data. By default, the proportion of data to be retained for modeling is \(3/4\).

Here we used the strata argument, which conducts a stratified split. This ensures that, despite the imbalance we noticed in our Calories variable, our training and testing sets will keep roughly the same proportions of poorly and well-segmented cells as in the original data.

After the initial_split, the training() and testing() functions return the actual data sets.

data_split <- initial_split(training_set, strata = Calories)
data_split

# Create tibbles for the two sets:
train_data <- training(data_split)  # 562,498 observations
eval_data  <- testing(data_split)   # 187,502 observations