我对大众提取特征的方式感到困惑。考虑一个文本分类问题,我想使用字符 ngram 作为特征。在说明我的问题的最简单的情况下,输入字符串是“aa”,我只使用 1-gram 特征。因此,该示例应包含计数为 2 的单个特征“a”,如下所示:
$ echo "1 |X a:2" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile =
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 1.0000 0.0000 1
finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 1
X^a:108118:0.196698
但是,如果我将字符串“aa”传递给 vw(在字符之间引入空格),vw 会报告 2 个特性:
$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile =
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 1.0000 0.0000 2
finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 2
X^a:108118:0.375311
实际模型只包含一个特征(如我所料),但它的权重(0.375311)与第一个模型(0.196698)不同。
在使用高阶 n-gram 训练真实数据集时,可以观察到平均损失的显着差异,具体取决于使用的输入格式。我查看了 parser.cc 中的源代码,如果有更多时间,我可能会弄清楚发生了什么;但如果有人可以解释上述两种情况之间的差异(这是一个错误吗?)和/或指出源的相关部分,我将不胜感激。