Cross validation
Cross-validation is a widely used validation strategy in the machine learning field when the goal is to find a model that generalizes well, meaning it delivers strong performance and robust results on previously unseen data.
To find such a model, it is common to split the available dataset into a training set and a test set. The training set is used to learn the appropriate model parameters, while the test set is used to assess the quality of the trained model on an independent dataset and to measure its prediction accuracy.
A simple out-of-sample validation, where the available dataset is split only once into training and validation data, has the disadvantage that relevant patterns in the validation data, which were not present in the training data, cannot be accounted for in the model. Conversely, if the validation set does not accurately represent reality, reliable insights into the generalizability of the model cannot be obtained.
Cross-validation addresses this problem by repeatedly splitting the dataset into training and test sets, so that each element is used exactly once for the test set. In this way, the model is ultimately trained on (almost) all the data and validated on (almost) all the data. In return, cross-validation requires more time for calculations compared to a simple out-of-sample validation.
The procedure for k-fold cross-validation works as follows:
- The dataset is divided into k (e.g., 5 or 10) equally sized blocks.
-
The values in each of the k blocks are predicted by repeating the following steps. These steps are performed for each block, meaning the process is repeated k times in total.
i) The model is trained on the remaining k-1 blocks (the training set).
ii) The values in the **k-th** block are predicted based on the trained model (validation set).
iii) The predicted values are compared with the actual values and evaluated using a performance metric.
-
The performance metrics obtained in this way are aggregated, for example, by calculating an average.
In cross-validation, the data should ideally be independent. Since this is not easily achievable with time series data, Bergmeir and Benítez (2012) proposed a modification for cross-validation based on time series data: removing some data points at the edges of the validation period. This reduces the dependency between the training and validation sets. This method is called **k-fold blocked cross-validation** and is typically used for time series forecasting.