Metrics for Multilabel Classification

Most of the supervised learning algorithms focus on either binary classification or multi-class classification. But sometimes, we will have dataset where we will have multi-labels for each observations. In this case, we would have different metrics to evaluate the algorithms, itself because multi-label prediction has an additional notion of being partially correct.

Let’s say that we have 4 observations and the actual and predicted values have been given as follows:

 
import numpy as np

y_true = np.array([[0,1,0],
                   [0,1,1],
                   [1,0,1],
                   [0,0,1]])

y_pred = np.array([[0,1,1],
                   [0,1,1],
                   [0,1,0],
                   [0,0,0]])

There are multiple example-based metrics to be used. We will look at couple of them below.

 
  MR = np.all(y_pred == y_true, axis=1).mean()
  #0.25
  
 
  01Loss = np.any(y_true != y_pred, axis=1).mean()
  #0.75
  
 
  def Accuracy(y_true, y_pred):
    temp = 0
    for i in range(y_true.shape[0]):
        temp += sum(np.logical_and(y_true[i], y_pred[i])) / sum(np.logical_or(y_true[i], y_pred[i]))
    return temp / y_true.shape[0]
    
  Accuracy(y_true, y_pred)
  #0.375
  
  def Hamming_Loss(y_true, y_pred):
    temp=0
    for i in range(y_true.shape[0]):
        temp += np.size(y_true[i] == y_pred[i]) - np.count_nonzero(y_true[i] == y_pred[i])
    return temp/(y_true.shape[0] * y_true.shape[1])
    
  Hamming_Loss(y_true, y_pred)
  #0.4166666666666667
  
  def Recall(y_true, y_pred):
      temp = 0
      for i in range(y_true.shape[0]):
          if sum(y_true[i]) == 0:
              continue
          temp+= sum(np.logical_and(y_true[i], y_pred[i]))/ sum(y_true[i])
      return temp/ y_true.shape[0]
   #0.5
   
  def Precision(y_true, y_pred):
      temp = 0
      for i in range(y_true.shape[0]):
          if sum(y_pred[i]) == 0:
              continue
          temp+= sum(np.logical_and(y_true[i], y_pred[i]))/ sum(y_pred[i])
      return temp/ y_true.shape[0]
   #0.375
   
  def F1Measure(y_true, y_pred):
    temp = 0
    for i in range(y_true.shape[0]):
        if (sum(y_true[i]) == 0) and (sum(y_pred[i]) == 0):
            continue
        temp+= (2*sum(np.logical_and(y_true[i], y_pred[i])))/ (sum(y_true[i])+sum(y_pred[i]))
    return temp/ y_true.shape[0]
    
    print(F1Measure(y_true, y_pred))
    #0.41666666666666663
   

One can also use Scikit Learn’s functions to compute accuracy, Hamming loss and other metrics:

                   [0,0,0]])
import sklearn.metrics

print('Exact Match Ratio: {0}'.format(sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)))
#Exact Match Ratio: 0.25

print('Hamming loss: {0}'.format(sklearn.metrics.hamming_loss(y_true, y_pred))) 
#Hamming loss: 0.4166666666666667

#"samples" applies only to multilabel problems. It does not calculate a per-class measure, instead calculating the metric over the true and predicted classes 
#for each sample in the evaluation data, and returning their (sample_weight-weighted) average.

print('Recall: {0}'.format(sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, average='samples'))) 
#Recall: 0.375

print('Precision: {0}'.format(sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, average='samples')))
#Precision: 0.5

print('F1 Measure: {0}'.format(sklearn.metrics.f1_score(y_true=y_true, y_pred=y_pred, average='samples'))) 
#F1 Measure: 0.41666666666666663

Sorower (2010) gives a nice overview for other metrics to be used: