on
7 mins to read.
Metrics for Multilabel Classification
Most of the supervised learning algorithms focus on either binary classification or multiclass classification. But sometimes, we will have dataset where we will have multilabels for each observations. In this case, we would have different metrics to evaluate the algorithms, itself because multilabel prediction has an additional notion of being partially correct.
Let’s say that we have 4 observations and the actual and predicted values have been given as follows:
There are multiple examplebased metrics to be used. We will look at couple of them below.

Exact Match Ratio
One trivial way around would just to ignore partially correct (consider them incorrect) and extend the accuracy used in single label case for multilabel prediction.
\begin{equation} \text{Exact Match Ratio, MR} = \frac{1}{n} \sum_{i=1}^{n} I(y_{i} = \hat{y_{i}}) \end{equation}
where $I$ is the indicator function. Clearly, a disadvantage of this measure is that it does not distinguish between complete incorrect and partially correct which might be considered harsh.

0/1 Loss
This metric is basically known as $1  \text{Exact Match Ratio}$, where we calculate proportions of instances whose actual value is not equal to predicted value.
\begin{equation} \text{0/1 Loss} = \frac{1}{n} \sum_{i=1}^{n} I\left(y_{i} \neq \hat{y_{i}} \right) \end{equation}

Accuracy
Accuracy for each instance is defined as the proportion of the predicted correct labels to the total number (predicted and actual) of labels for that instance. Overall accuracy is the average across all instances. It is less ambiguously referred to as the Hamming score.
\begin{equation} \text{Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \frac{\lvert y_{i} \cap \hat{y_{i}}\rvert}{\lvert y_{i} \cup \hat{y_{i}}\rvert} \end{equation}

Hamming Loss
It reports how many times on average, the relevance of an example to a class label is incorrectly predicted. Therefore, hamming loss takes into account the prediction error (an incorrect label is predicted) and missing error (a relevant label not predicted), normalized over total number of classes and total number of examples.
\begin{equation} \text{Hamming Loss} = \frac{1}{n L} \sum_{i=1}^{n}\sum_{j=1}^{L} I\left( y_{i}^{j} \neq \hat{y}_{i}^{j} \right) \end{equation}
where $I$ is the indicator function. Ideally, we would expect the hamming loss to be 0, which would imply no error; practically the smaller the value of hamming loss, the better the performance of the learning algorithm.

Recall
It is the proportion of predicted correct labels to the total number of actual labels, averaged over all instances.
\begin{equation} \text{Recall} = \frac{1}{n} \sum_{i=1}^{n} \frac{\lvert y_{i} \cap \hat{y_{i}}\rvert}{\lvert y_{i}\rvert} \end{equation}

Precision
It is the propotion of predicted correct labels to the total number of predicted labels, averaged over all instances.
\begin{equation} \text{Precision} = \frac{1}{n} \sum_{i=1}^{n} \frac{\lvert y_{i} \cap \hat{y_{i}}\rvert}{\lvert \hat{y_{i}}\rvert} \end{equation}

F1Measure
Definition of precision and recall naturally leads to the following definition for F1measure (harmonic mean of precision and recall):
\begin{equation} F_{1} = \frac{1}{n} \sum_{i=1}^{n} \frac{2 \lvert y_{i} \cap \hat{y_{i}}\rvert}{\lvert y_{i}\rvert + \lvert \hat{y_{i}}\rvert} \end{equation}
One can also use Scikit Learn’s functions to compute accuracy, Hamming loss and other metrics:
Sorower (2010) gives a nice overview for other metrics to be used: