我是 fastText 的新手,并且已经阅读了教程:https ://fasttext.cc/docs/en/supervised-tutorial.html 。
我下载了示例数据,发现标签是字符串类型。
$ head cooking.stackexchange.txt
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
以及教程中的训练和测试代码。
>>> model = fasttext.train_supervised(input="cooking.train", lr=1.0)
Read 0M words
Number of words: 9012
Number of labels: 734
Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m
>>> model.test("cooking.valid")
(3000L, 0.563, 0.245)
我的问题是为什么不应用标签(比如sklearn)LabelEncoder?我已经运行了这个例子,它运行良好。我很困惑。
[更新] - - - -
IMO,代码如下所示
from sklearn import preprocessing
texts_train, labels_train = load_dataset()
label_encoder = preprocessing.LabelEncoder()
labels_train = label_encoder.fit_transform(labels_train)
with open('cooking.train.2', 'w') as f:
for i in range(len(texts_train)):
f.write('%s __label__%d\n' % (texts_train[i], labels_train[i]))
model = fasttext.train_supervised('cooking.train.2',lr=1.0)