Training and Validation
Machine learning or forecasting models are typically trained or estimated on data. The goal is to find a model that generalizes well, meaning it performs well on new, previously unseen data and provides robust results.
Hyperparameter
A hyperparameter is a parameter used to configure the algorithm in machine learning methods that cannot be estimated by the procedure itself. For example, in Artificial Neural Networks (ANNs), these include the number of epochs for training or the number of hidden layers. Hyperparameters are set by the user before the training process, often manually.
Overfitting
Overfitting occurs when the model adapts too closely to the training data and starts to detect patterns that do not actually exist. This is particularly the case in time series models when, in addition to important structural components such as trends or seasonality, the model also captures the random fluctuations of the training data.
Such models are generally not generalizable, meaning they do not provide stable results on new, unseen data that was not used for model training. The cause of overfitting is that the model is too complex, i.e., it contains too many parameters or explanatory variables that are overly fitted to the training data. To avoid or reduce overfitting, there are several strategies, such as:
- Out-of-sample Validation on an independent test dataset
- Cross validation
- Penalization of model complexity through regularization (such as in regularized regression)
- Pruning (e.g., in decision tree methods)
Out-of-sample Validation
To find a model that generalizes well, it is common to split the available dataset into a training set and a test set. The former is used to learn the appropriate model parameters, while the latter is used to evaluate the performance of the trained model on an independent dataset and to assess its prediction accuracy.
In the context of time series, there are particularities to consider during out-of-sample validation due to the temporal order and dependencies of individual data points, as seen in backtesting.
Cross validation
Cross-validation is a common validation strategy in the machine learning field to avoid overfitting and provide robust results on unseen data.
A simple out-of-sample validation, where the available dataset is only split once into training and validation data, has the disadvantage that relevant patterns from the validation data, which were not recognizable in the training data, cannot be accounted for in the model. Conversely, if the validation set does not accurately reflect reality, no reliable insights into the model's generalizability can be gained. This issue is addressed by cross-validation, which repeatedly splits the dataset into training and test sets until each element has been used once for the test set. In this way, the model is trained on (almost) all data and validated on (almost) all data at least once. Learn more about cross-validation