2

In my data, there are about 70 classes and I am using lightGBM to predict the correct class label.

In R, would like to have a customised "metric" function where I can evaluate whether top 3 predictions by lightgbm cover the true label.

The link here is inspiring to see

def lgb_f1_score(y_hat, data):
    y_true = data.get_label()
    y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
    return 'f1', f1_score(y_true, y_hat), True

however I don't know the dimensionality of the arguments going to function. seems data are shuffled for some reason.

4

2 回答 2

4

Scikit-learn 实现

from sklearn.metrics import f1_score

def lgb_f1_score(y_true, y_pred):
    preds = y_pred.reshape(len(np.unique(y_true)), -1)
    preds = preds.argmax(axis = 0)
    print(preds.shape)
    print(y_true.shape)
    return 'f1', f1_score(y_true, preds,average='weighted'), True
于 2018-08-05T04:02:05.613 回答
1

在阅读了lgb.trainlgb.cv的文档后,我不得不创建一个单独的函数get_ith_pred,然后在lgb_f1_score.

该函数的文档字符串解释了它是如何工作的。我使用了与 LightGBM 文档中相同的参数名称。这适用于任意数量的类,但不适用于二元分类。在二进制情况下,preds是一个包含正类概率的一维数组。

from sklearn.metrics import f1_score

def get_ith_pred(preds, i, num_data, num_class):
    """
    preds: 1D NumPY array
        A 1D numpy array containing predicted probabilities. Has shape
        (num_data * num_class,). So, For binary classification with 
        100 rows of data in your training set, preds is shape (200,), 
        i.e. (100 * 2,).
    i: int
        The row/sample in your training data you wish to calculate
        the prediction for.
    num_data: int
        The number of rows/samples in your training data
    num_class: int
        The number of classes in your classification task.
        Must be greater than 2.
    
    
    LightGBM docs tell us that to get the probability of class 0 for 
    the 5th row of the dataset we do preds[0 * num_data + 5].
    For class 1 prediction of 7th row, do preds[1 * num_data + 7].
    
    sklearn's f1_score(y_true, y_pred) expects y_pred to be of the form
    [0, 1, 1, 1, 1, 0...] and not probabilities.
    
    This function translates preds into the form sklearn's f1_score 
    understands.
    """
    # Only works for multiclass classification
    assert num_class > 2

    preds_for_ith_row = [preds[class_label * num_data + i]
                        for class_label in range(num_class)]
    
    # The element with the highest probability is predicted
    return np.argmax(preds_for_ith_row)

    
def lgb_f1_score(preds, train_data):
    y_true = train_data.get_label()

    num_data = len(y_true)
    num_class = 70
    
    y_pred = []
    for i in range(num_data):
        ith_pred = get_ith_pred(preds, i, num_data, num_class)
        y_pred.append(ith_pred)
    
    return 'f1', f1_score(y_true, y_pred, average='weighted'), True
于 2021-03-01T16:11:12.230 回答