Cross validation

Cross-validation is a widely used validation strategy in the machine learning field when the goal is to find a model that generalizes well, meaning it delivers strong performance and robust results on previously unseen data.

To find such a model, it is common to split the available dataset into a training set and a test set. The training set is used to learn the appropriate model parameters, while the test set is used to assess the quality of the trained model on an independent dataset and to measure its prediction accuracy.

A simple out-of-sample validation, where the available dataset is split only once into training and validation data, has the disadvantage that relevant patterns in the validation data, which were not present in the training data, cannot be accounted for in the model. Conversely, if the validation set does not accurately represent reality, reliable insights into the generalizability of the model cannot be obtained.

Cross-validation addresses this problem by repeatedly splitting the dataset into training and test sets, so that each element is used exactly once for the test set. In this way, the model is ultimately trained on (almost) all the data and validated on (almost) all the data. In return, cross-validation requires more time for calculations compared to a simple out-of-sample validation.

The procedure for k-fold cross-validation works as follows:

  1. The dataset is divided into k (e.g., 5 or 10) equally sized blocks.
  2. The values in each of the k blocks are predicted by repeating the following steps. These steps are performed for each block, meaning the process is repeated k times in total.

    i) The model is trained on the remaining k-1 blocks (the training set).

    ii) The values in the **k-th** block are predicted based on the trained model (validation set).

    iii) The predicted values are compared with the actual values and evaluated using a performance metric.

  3. The performance metrics obtained in this way are aggregated, for example, by calculating an average.

In cross-validation, the data should ideally be independent. Since this is not easily achievable with time series data, Bergmeir and Benítez (2012) proposed a modification for cross-validation based on time series data: removing some data points at the edges of the validation period. This reduces the dependency between the training and validation sets. This method is called **k-fold blocked cross-validation** and is typically used for time series forecasting.

You are about to leave our website via an external link. Please note that the content of the linked page is beyond our control.

Cookies und andere (Dritt-)Dienste

Diese Website speichert Cookies auf Ihrem Computer nur, wenn Sie dem ausdrücklich zustimmen. Bei Zustimmung werden insbesondere auch Dritt-Dienste eingebunden, die zusätzliche Funktionalitäten, wie beispielsweise die Buchung von Terminen, bereitstellen. Diese Cookies und Dienste werden verwendet, um Informationen darüber zu sammeln, wie Sie mit unserer Website interagieren, und um Ihre Browser-Erfahrung zu verbessern und anzupassen. Zudem nutzen wir diese Informationen für Analysen und Messungen zu unseren Besuchern auf dieser Website und anderen Medien. Weitere Informationen zu den von uns verwendeten Cookies und Dritt-Diensten finden Sie in unseren Datenschutzbestimmungen.