我正在尝试使用两个特征在 lightgbm 中创建一个简单的模型,一个是分类的,另一个是距离。我正在关注一个教程(https://sefiks.com/2018/10/13/a-gentle-introduction-to-lightgbm-for-applied-machine-learning/),它指出即使在 LabelEncoding 之后,我仍然需要告诉 lightgbm 我的编码特征本质上是分类的。但是,当我尝试这样做时,会收到以下一系列警告消息:
UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['type']
'New categorical_feature is
{}'.format(sorted(list(categorical_feature))))
categorical_feature in param dict is overridden.
warnings.warn('categorical_feature in param dict is overridden.')
我想知道的是 lightgbm 是否确实理解该列本质上是分类的。似乎确实如此,但我不确定为什么教程明确指出它没有。下面是我的代码:
trainDataProc = pd.read_csv('trainDataPrepared.csv', header=0)
le=prep.LabelEncoder()
num_columns=trainDataProc.shape[1]
for i in range(0, num_columns):
column_name=trainDataProc.columns[i]
column_type=trainDataProc[column_name].dtypes
if column_type == 'object':
le.fit(trainDataProc[column_name])
encoded_feature=le.transform(trainDataProc[column_name])
trainDataProc[column_name]=pd.DataFrame(encoded_feature)
# Prepare train X and Y column names.
trainColumnsX = ['type', 'dist']
cat_feat=['type']
trainColumnsY = ['scalar']
# Perform K-fold split.
kfold = mls.KFold(n_splits=5, shuffle=True, random_state=0)
result = next(kfold.split(trainDataProc), None)
train = trainDataProc.iloc[result[0]]
test = trainDataProc.iloc[result[1]]
# Train model via lightGBM.
lgbTrain = lgb.Dataset(train[trainColumnsX], label=train[trainColumnsY],
categorical_feature=cat_feat)
lgbEval = lgb.Dataset(test[trainColumnsX], label=test[trainColumnsY])
# Model parameters.
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'mae'},
'num_leaves': 25,
'learning_rate': 0.0001,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}
# Set up training.
gbm = lgb.train(params,
lgbTrain,
num_boost_round=200,
valid_sets=lgbEval,
early_stopping_rounds=50)