我正在参加 Kaggle 比赛(https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation),它指出我的模型将通过以下方式进行评估:
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
我在文档中找不到这个(基本上是RMSE(log(truth), log(prediction)
),所以我开始编写自定义记分器:
def custom_loss(truth, preds):
truth_logs = np.log(truth)
print(truth_logs)
preds_logs = np.log(preds)
numerator = np.sum(np.square(truth_logs - preds_logs))
return np.sum(np.sqrt(numerator / len(truth)))
custom_scorer = make_scorer(custom_loss, greater_is_better=False)
两个问题:
1)我的自定义损失函数是否应该返回一个 numpy 分数数组(每个(真相,预测)对一个?还是应该是这些(真相,预测)对的总损失,返回一个数字?
我查看了文档,但它们并不是很有帮助:我的自定义损失函数应该返回什么。
2)当我跑步时:
xgb_model = xgb.XGBRegressor()
params = {"max_depth": [3, 4], "learning_rate": [0.05],
"n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)
grid_search_cv.fit(X, y)
grid_search_cv.best_score_
我回来了:
-0.12137097567803554
这是非常令人惊讶的。鉴于我的损失函数正在RMSE(log(truth) - log(prediction))
使用,我不应该有一个负数best_score_
。
知道为什么它是负面的吗?
谢谢!