vowpalwabbit - Vowpal Wabbit 特征提取

Question

我对大众提取特征的方式感到困惑。考虑一个文本分类问题，我想使用字符 ngram 作为特征。在说明我的问题的最简单的情况下，输入字符串是“aa”，我只使用 1-gram 特征。因此，该示例应包含计数为 2 的单个特征“a”，如下所示：

$ echo "1 |X a:2" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000            1         1.0   1.0000   0.0000        1

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 1
X^a:108118:0.196698

但是，如果我将字符串“aa”传递给 vw（在字符之间引入空格），vw 会报告 2 个特性：

$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000            1         1.0   1.0000   0.0000        2

finished run
number of examples per pass = 1
passes used = 1
weighted example sum = 1
weighted label sum = 1
average loss = 1
best constant = 1
total feature number = 2
X^a:108118:0.375311

实际模型只包含一个特征（如我所料），但它的权重（0.375311）与第一个模型（0.196698）不同。

在使用高阶 n-gram 训练真实数据集时，可以观察到平均损失的显着差异，具体取决于使用的输入格式。我查看了 parser.cc 中的源代码，如果有更多时间，我可能会弄清楚发生了什么；但如果有人可以解释上述两种情况之间的差异（这是一个错误吗？）和/或指出源的相关部分，我将不胜感激。

score 2 · Accepted Answer

我想总特征数值只是观察到的特征的计数器。例如，您将获得 10 用于以下命令：

$ echo "1 |X a" | vw --noconstant --passes 10 --cache_file f -k

我还在 vw 代码中看到在打印输出之前将特征的回归量值除以特征权重。从以下可以看出：

$ echo "1 |X a:1" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a:2" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.196698
$ echo "1 |X a:3" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.131132
$ echo "1 |X a:10" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.039344

我怀疑这些特性是排他的，像“|X a”和“|X a a”这样的例子应该给出相同的结果，但它们不是：

$ echo "1 |X a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.375311
$ echo "1 |X a a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.366083

我真的不知道为什么。这背后应该有一个逻辑。但是，如果您指定 --sort_features，它会按预期工作（由我）

$ echo "1 |X a" | vw --noconstant --invert_hash f && grep '^X^' f
X^a:108118:0.393395
echo "1 |X a a a a a" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395

有趣的事实是，如果您指定 --sort_features vw 仅使用第一次出现的功能。例子：

$ echo "1 |X a a:10" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a a:2" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.393395
$ echo "1 |X a:10 a" | vw --noconstant --invert_hash f --sort_features && grep '^X^' f
X^a:108118:0.039344

我希望通过这些观察，您将能够根据需要使 vw 工作。但我不确定这是错误还是功能。将转发给大众作者发表评论。

vowpalwabbit - Vowpal Wabbit 特征提取

1 回答 1

Related

Reference