Multi-Class Classification
Multi-class classification refers to classification problems where data must be assigned to exactly one of three or more possible classes. For example, if we try to assign the correct fruit to images of apples, pears, and oranges, we are dealing with a multi-class classification problem.
Strategies for Multi-Class Classification
There are various strategies and methods to approach a multi-class classification problem. Some methods, such as SVC or a simple perceptron, are specifically designed for binary classification problems but can also be used for multi-class classification by reducing the problem to several binary classifications (see One-vs-all and the following sections). Other methods are inherently capable of handling more than two classes. These methods include, among others:
- Decision Tree Methods
- Multilayer Perceptron (MLP)
- Nearest Neighbor Methods
Below, we briefly introduce several strategies to convert a multi-class classification problem into multiple binary classification problems.
One-vs-All (OvA)
In "One-vs-All" (OvA), a binary classifier is trained for each class, which should distinguish that class from all other classes. At the end, the class is chosen whose binary classifier "fires". In practice, multiple classifiers (or even none) can provide a positive result. Therefore, a subsequent process is necessary to make a final decision, such as by using classification probabilities (How confident are the classifiers in their classification?) to reach a definitive conclusion.
Potential disadvantages:
- No balanced training data, as typically each classifier sees far fewer positive examples (the class currently being considered) than negative examples (all other classes).
- Depending on the definition, the confidence values of the various classifiers may vary in magnitude.
One-vs-One (OvO)
In "One-vs-One" (OvO), a binary classifier is trained for each pair of classes, which should distinguish between the two classes. At the end, the class is chosen for which the most classifiers vote. In practice, multiple classes may receive the same number of "votes", so a quantification of confidence in the classifier's decision is needed to make a definitive choice.
Potential disadvantages:
- Many classifiers must be trained.
- Depending on the definition, the confidence values of the various classifiers may vary in magnitude.
Error-Correcting Output Code
The different classes are encoded using binary sequences of a fixed length (codewords), where the number of bits in the binary sequences is usually greater than the number of classes. A binary classifier is trained for each bit in the binary sequence. A new object is assigned to the class whose binary code is closest to the predicted binary sequence. The motivation behind this strategy is that if the binary sequence length is greater than the number of classes, errors from individual classifiers can be corrected by the others. For this error-correcting function to be effective, it is important that the various classifiers are uncorrelated, i.e., learn different features of the objects. Uncorrelated classifiers are promoted, for example, by a clever choice of codewords.
Potential disadvantages:
- Although more classifiers are trained than there are classes, and thus errors from individual classifiers can potentially be corrected, it may happen that all classifiers learn very similar patterns and thus cannot correct each other.
Hierarchical Classification
The idea behind hierarchical classification is to construct a tree whose leaves represent the different classes. One possible structure could be to distinguish between class A and the remaining classes at the first level, then between class B and the remaining classes at the second level, and so on. This would correspond to a kind of OvA with hierarchy. However, other tree structures are also conceivable, and the specific use case may suggest a particular tree structure: For example, if we were classifying different dog and cat breeds, we might first distinguish between dogs and cats at the first level, and then identify the specific breed at the second level.
Evaluation of Multi-Class Classifiers
The confusion matrix shows the relationship between the predicted class memberships from the classifier and the actual class memberships. In the example below, the columns represent the predicted class memberships by the classifier and the rows represent the actual class memberships, using a classifier for the MNIST dataset of handwritten digits 0-9.
0 (Classifier) | 1 (Classifier) | 2 (Classifier) | 3 (Classifier) | 4 (Classifier) | 5 (Classifier) | 6 (Classifier) | 7 (Classifier) | 8 (Classifier) | 9 (Classifier) | |
---|---|---|---|---|---|---|---|---|---|---|
0 (Actual) | 970 | 1 | 1 | 0 | 0 | 2 | 2 | 1 | 3 | 0 |
1 (Actual) | 0 | 1124 | 2 | 3 | 0 | 2 | 3 | 0 | 1 | 0 |
2 (Actual) | 6 | 0 | 999 | 3 | 4 | 0 | 4 | 8 | 7 | 1 |
3 (Actual) | 0 | 0 | 11 | 973 | 0 | 7 | 0 | 8 | 7 | 4 |
4 (Actual) | 1 | 0 | 1 | 0 | 957 | 0 | 5 | 0 | 2 | 16 |
5 (Actual) | 5 | 0 | 0 | 11 | 2 | 859 | 7 | 2 | 4 | 2 |
6 (Actual) | 7 | 3 | 1 | 0 | 5 | 5 | 935 | 0 | 2 | 0 |
7 (Actual) | 1 | 3 | 21 | 1 | 2 | 0 | 0 | 989 | 4 | 7 |
8 (Actual) | 3 | 1 | 4 | 3 | 5 | 6 | 3 | 4 | 935 | 10 |
9 (Actual) | 6 | 5 | 2 | 12 | 10 | 3 | 1 | 5 | 4 | 961 |
The number in the i-th row and j-th column of the matrix indicates how many times the classifier predicted the j-th class for objects that actually belong to class i. Specifically, the total number of correctly classified objects is obtained by summing the diagonal entries.
As with binary classifications, precision and sensitivity are important metrics to quantify the performance of a classifier. However, in the multi-class case, there is not just one value for precision and sensitivity, but the two quantities are defined per class. As a scalar measure that accounts for both precision and sensitivity, the F1-score is often considered. If one wishes to weight precision and sensitivity differently, the Fbeta-score can be used as a generalization of the F1-score.
- Precision: What percentage of images classified as apples by the classifier are actually apples?
- Sensitivity (Recall): What percentage of images that actually represent apples were correctly recognized by the classifier
- F1-score: Harmonic mean of precision and sensitivity.
- Fbeta-score: Weighted harmonic mean of precision and sensitivity. Can be used to emphasize one of the two metrics more strongly.
Since precision, sensitivity, and F1-score (or more generally Fbeta-score) are defined per class, we need to average the values to derive a single scalar metric for the classifier's performance. Here, we distinguish between the macro average and the weighted average.
- Macro average: Simple arithmetic mean of the metric values across all classes.
- Weighted average: Weighted arithmetic mean of the metric values across all classes, where the weights are the proportions of objects in each class relative to the total number of objects.
For example, if the precision for apples is 80%, for oranges is 70%, and for pears is 90%, the macro average for sensitivity is thus 80%. The weighted average is calculated by dividing the metric for each class by the number of observations in that class. If we have 20 apples, 30 oranges, and 50 pears, and the precisions are as stated, then the weighted average is calculated as (20x80% + 30x70% + 50x90%)/(20+30+50) = 82%. The macro average and weighted average for sensitivity and the F1-score can be computed similarly.
The accuracy, by the way, is the proportion of objects correctly classified by the classifier across all classes.
Handling Imbalanced Datasets
In a dataset for a multi-class classification problem, if not all classes contain (approximately) the same number of objects, but instead some classes contain significantly more than others, we refer to it as an imbalanced dataset. An imbalanced dataset can pose challenges when training a classifier because classes with more objects are often favored. For instance, if 99% of the observations in a dataset belong to class A and only 1% to class B, a classifier that assigns class A to every object already achieves an accuracy of 99%. Using a naive evaluation strategy, such a classifier would likely be rated better than many classifiers that occasionally select class B.
The following strategies are commonly used to handle imbalanced datasets:
- Undersampling: Reduce the dataset so that each class contains approximately as many objects as the smallest class.
- Oversampling: Increase the dataset, either through duplicates or synthetically generated data, so that each class contains approximately as many objects as the largest class. One algorithm for oversampling through synthetically generated data is SMOTE (Synthetic Minority Over-sampling Technique).
- Cost-Sensitive Learning: Many machine learning algorithms provide options to penalize misclassifications differently depending on the actual class of the object. This allows the classifier training to take the differing class sizes into account.