我正在尝试使用 vw 来查找预测某人是否会打开电子邮件的单词或短语。如果他们打开电子邮件,则目标为 1,否则为 0。我的数据如下所示:
1 |A this is a test
0 |A this test is only temporary
1 |A i bought a new polo shirt
1 |A that was a great online sale
我将它放入一个名为“test1.txt”的文件中并运行以下代码来执行 2 的 ngrams 并输出变量信息:
C:\~\vw>perl vw-varinfo.pl -V --ngram 2 test1.txt >> out.txt
当我查看输出时,有一些我在原始数据中看不到的二元组。这是一个错误还是我误解了什么。
输出:
FeatureName HashVal MinVal MaxVal Weight RelScore
A^a 239656 0.00 1.00 +0.1664 100.00%
A^is 7514 0.00 1.00 +0.0772 46.38%
A^test 12331 0.00 1.00 +0.0772 46.38%
A^this 169573 0.00 1.00 +0.0772 46.38%
A^bought 245782 0.00 1.00 +0.0650 39.06%
A^i 245469 0.00 1.00 +0.0650 39.06%
A^new 51974 0.00 1.00 +0.0650 39.06%
A^polo 48680 0.00 1.00 +0.0650 39.06%
A^shirt 73882 0.00 1.00 +0.0650 39.06%
A^great 220692 0.00 1.00 +0.0610 36.64%
A^online 147727 0.00 1.00 +0.0610 36.64%
A^sale 242707 0.00 1.00 +0.0610 36.64%
A^that 206586 0.00 1.00 +0.0610 36.64%
A^was 223274 0.00 1.00 +0.0610 36.64%
A^a^bought 216990 0.00 0.00 +0.0000 0.00%
A^bought^great 7122 0.00 0.00 +0.0000 0.00%
A^great^i 190625 0.00 0.00 +0.0000 0.00%
A^i^is 76227 0.00 0.00 +0.0000 0.00%
A^is^new 140536 0.00 0.00 +0.0000 0.00%
A^new^online 69117 0.00 0.00 +0.0000 0.00%
A^online^only 173498 0.00 0.00 +0.0000 0.00%
A^only^polo 51059 0.00 0.00 +0.0000 0.00%
A^polo^sale 131483 0.00 0.00 +0.0000 0.00%
A^sale^shirt 191329 0.00 0.00 +0.0000 0.00%
A^shirt^temporary 81555 0.00 0.00 +0.0000 0.00%
A^temporary^test 90632 0.00 0.00 +0.0000 0.00%
A^test^that 13689 0.00 0.00 +0.0000 0.00%
A^that^this 127863 0.00 0.00 +0.0000 0.00%
A^this^was 22011 0.00 0.00 +0.0000 0.00%
Constant 116060 0.00 0.00 +0.1465 0.00%
A^only 62951 0.00 1.00 -0.0490 -29.47%
A^temporary 44641 0.00 1.00 -0.0490 -29.47%
例如,^bought^great
从未实际出现在任何原始输入行中。难道我做错了什么?