Multi-Class Classification

Multi-class classification refers to classification problems where data must be assigned to exactly one of three or more possible classes. For example, if we try to assign the correct fruit to images of apples, pears, and oranges, we are dealing with a multi-class classification problem.

Strategies for Multi-Class Classification

There are various strategies and methods to approach a multi-class classification problem. Some methods, such as SVC or a simple perceptron, are specifically designed for binary classification problems but can also be used for multi-class classification by reducing the problem to several binary classifications (see One-vs-all and the following sections). Other methods are inherently capable of handling more than two classes. These methods include, among others:

Below, we briefly introduce several strategies to convert a multi-class classification problem into multiple binary classification problems.

One-vs-All (OvA)

In "One-vs-All" (OvA), a binary classifier is trained for each class, which should distinguish that class from all other classes. At the end, the class is chosen whose binary classifier "fires". In practice, multiple classifiers (or even none) can provide a positive result. Therefore, a subsequent process is necessary to make a final decision, such as by using classification probabilities (How confident are the classifiers in their classification?) to reach a definitive conclusion.

Potential disadvantages:

One-vs-One (OvO)

In "One-vs-One" (OvO), a binary classifier is trained for each pair of classes, which should distinguish between the two classes. At the end, the class is chosen for which the most classifiers vote. In practice, multiple classes may receive the same number of "votes", so a quantification of confidence in the classifier's decision is needed to make a definitive choice.

Potential disadvantages:

Error-Correcting Output Code

The different classes are encoded using binary sequences of a fixed length (codewords), where the number of bits in the binary sequences is usually greater than the number of classes. A binary classifier is trained for each bit in the binary sequence. A new object is assigned to the class whose binary code is closest to the predicted binary sequence. The motivation behind this strategy is that if the binary sequence length is greater than the number of classes, errors from individual classifiers can be corrected by the others. For this error-correcting function to be effective, it is important that the various classifiers are uncorrelated, i.e., learn different features of the objects. Uncorrelated classifiers are promoted, for example, by a clever choice of codewords.

Potential disadvantages:

Hierarchical Classification

The idea behind hierarchical classification is to construct a tree whose leaves represent the different classes. One possible structure could be to distinguish between class A and the remaining classes at the first level, then between class B and the remaining classes at the second level, and so on. This would correspond to a kind of OvA with hierarchy. However, other tree structures are also conceivable, and the specific use case may suggest a particular tree structure: For example, if we were classifying different dog and cat breeds, we might first distinguish between dogs and cats at the first level, and then identify the specific breed at the second level.

Evaluation of Multi-Class Classifiers

The confusion matrix shows the relationship between the predicted class memberships from the classifier and the actual class memberships. In the example below, the columns represent the predicted class memberships by the classifier and the rows represent the actual class memberships, using a classifier for the MNIST dataset of handwritten digits 0-9.

0 (Classifier) 1 (Classifier) 2 (Classifier) 3 (Classifier) 4 (Classifier) 5 (Classifier) 6 (Classifier) 7 (Classifier) 8 (Classifier) 9 (Classifier)
0 (Actual) 970 1 1 0 0 2 2 1 3 0
1 (Actual) 0 1124 2 3 0 2 3 0 1 0
2 (Actual) 6 0 999 3 4 0 4 8 7 1
3 (Actual) 0 0 11 973 0 7 0 8 7 4
4 (Actual) 1 0 1 0 957 0 5 0 2 16
5 (Actual) 5 0 0 11 2 859 7 2 4 2
6 (Actual) 7 3 1 0 5 5 935 0 2 0
7 (Actual) 1 3 21 1 2 0 0 989 4 7
8 (Actual) 3 1 4 3 5 6 3 4 935 10
9 (Actual) 6 5 2 12 10 3 1 5 4 961

The number in the i-th row and j-th column of the matrix indicates how many times the classifier predicted the j-th class for objects that actually belong to class i. Specifically, the total number of correctly classified objects is obtained by summing the diagonal entries.

As with binary classifications, precision and sensitivity are important metrics to quantify the performance of a classifier. However, in the multi-class case, there is not just one value for precision and sensitivity, but the two quantities are defined per class. As a scalar measure that accounts for both precision and sensitivity, the F1-score is often considered. If one wishes to weight precision and sensitivity differently, the Fbeta-score can be used as a generalization of the F1-score.

Since precision, sensitivity, and F1-score (or more generally Fbeta-score) are defined per class, we need to average the values to derive a single scalar metric for the classifier's performance. Here, we distinguish between the macro average and the weighted average.

For example, if the precision for apples is 80%, for oranges is 70%, and for pears is 90%, the macro average for sensitivity is thus 80%. The weighted average is calculated by dividing the metric for each class by the number of observations in that class. If we have 20 apples, 30 oranges, and 50 pears, and the precisions are as stated, then the weighted average is calculated as (20x80% + 30x70% + 50x90%)/(20+30+50) = 82%. The macro average and weighted average for sensitivity and the F1-score can be computed similarly.

The accuracy, by the way, is the proportion of objects correctly classified by the classifier across all classes.

Handling Imbalanced Datasets

In a dataset for a multi-class classification problem, if not all classes contain (approximately) the same number of objects, but instead some classes contain significantly more than others, we refer to it as an imbalanced dataset. An imbalanced dataset can pose challenges when training a classifier because classes with more objects are often favored. For instance, if 99% of the observations in a dataset belong to class A and only 1% to class B, a classifier that assigns class A to every object already achieves an accuracy of 99%. Using a naive evaluation strategy, such a classifier would likely be rated better than many classifiers that occasionally select class B.

The following strategies are commonly used to handle imbalanced datasets:

You are about to leave our website via an external link. Please note that the content of the linked page is beyond our control.

Cookies und andere (Dritt-)Dienste

Diese Website speichert Cookies auf Ihrem Computer nur, wenn Sie dem ausdrücklich zustimmen. Bei Zustimmung werden insbesondere auch Dritt-Dienste eingebunden, die zusätzliche Funktionalitäten, wie beispielsweise die Buchung von Terminen, bereitstellen. Diese Cookies und Dienste werden verwendet, um Informationen darüber zu sammeln, wie Sie mit unserer Website interagieren, und um Ihre Browser-Erfahrung zu verbessern und anzupassen. Zudem nutzen wir diese Informationen für Analysen und Messungen zu unseren Besuchern auf dieser Website und anderen Medien. Weitere Informationen zu den von uns verwendeten Cookies und Dritt-Diensten finden Sie in unseren Datenschutzbestimmungen.