我正在使用 gensim 加载预训练的 fasttext 模型。我从 fasttext网站下载了英文维基百科训练模型。
这是我为加载预训练模型而编写的代码:
from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")
我尝试检查人声中是否存在以下短语(这种情况很少见,因为这些是预先训练的模型)。
print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)
False
True
所以词汇表中没有“内部执行”这个短语,但我们仍然有与之对应的词向量。
model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
-0.06304111, 0.07833624, -0.17026938, -0.21922196, 0.01146349,
-0.13639058, 0.17283678, -0.09251394, -0.17875175, 0.01339212,
-0.26683623, 0.05487974, -0.11843193, -0.01982722, 0.37037706,
-0.24370994, 0.14269598, -0.16363597, 0.00328478, -0.16560239,
-0.1450972 , -0.24787527, -0.01318423, 0.03277111, 0.16175713,
-0.19367714, 0.16955379, 0.1972683 , 0.09044111, 0.01731548,
-0.0034324 , -0.04834719, 0.14321515, 0.01422525, -0.08803893,
-0.29411593, -0.1033244 , 0.06278021, 0.16452256, 0.0650492 ,
0.1506474 , -0.14194389, 0.10778475, 0.16008648, -0.07853138,
0.2183501 , -0.25451994, -0.0345991 , -0.28843886, 0.19964759,
-0.10923116, 0.26665714, -0.02544454, 0.30637854, 0.04568949,
-0.04798719, -0.05769338, 0.25762403, -0.05158515, -0.04426906,
-0.19901046, 0.00894193, -0.17269588, -0.24747233, -0.19061406,
0.14322804, -0.10804397, 0.4002605 , 0.01409482, -0.04675362,
0.10039093, 0.07260711, -0.0938239 , -0.20434211, 0.05741301,
0.07592541, -0.02921724, 0.21137556, -0.23188967, -0.23164661,
-0.4569614 , 0.07434579, 0.10841205, -0.06514647, 0.01220404,
0.02679767, 0.11840229, 0.2247431 , -0.1946325 , -0.0990666 ,
-0.02524677, 0.0801085 , 0.02437297, 0.00674876, 0.02088535,
0.21464555, -0.16240154, 0.20670174, -0.21640894, 0.03900698,
0.21772243, 0.01954809, 0.04541844, 0.18990673, 0.11806394,
-0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
0.1967368 , 0.09804545, 0.1440366 , -0.08401451, -0.03715726,
0.27826542, -0.25195453, -0.16737154, 0.3561183 , -0.15756823,
0.06724873, -0.295487 , 0.28395334, -0.04908851, 0.09448399,
0.10877471, -0.05020981, -0.24595442, -0.02822314, 0.17862654,
0.06452435, -0.15105674, -0.31911567, 0.08166212, 0.2634299 ,
0.17043628, 0.10063848, 0.0687021 , -0.12210461, 0.10803893,
0.13644943, 0.10755012, -0.09816817, 0.11873955, -0.03881042,
0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
0.05387992, 0.0497043 , 0.06922272, -0.0089245 , 0.24790663,
0.27209425, -0.04925154, -0.08621719, 0.15918174, 0.25831223,
0.01654229, -0.03617229, -0.13490392, 0.08033483, 0.34922174,
-0.01744722, -0.16894792, -0.10506647, 0.21708378, -0.22582002,
0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
-0.06613475, -0.08779443, -0.10732629, 0.05967236, -0.02455976,
0.2229451 , -0.19476262, -0.2720119 , 0.03687386, -0.01220259,
0.07704347, -0.1674307 , 0.2400516 , 0.07338555, -0.2000631 ,
0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
0.41587186, 0.04643605, 0.03352945, -0.13700874, 0.16430037,
-0.13630766, -0.18546128, -0.04692861, 0.37308362, -0.30846512,
0.5535561 , -0.11573419, 0.2332801 , -0.07236694, -0.01018955,
0.05936847, 0.25877884, -0.2959846 , -0.13610311, 0.10905041,
-0.18220575, 0.06902339, -0.10624941, 0.33002165, -0.12087796,
0.06742091, 0.20762768, -0.34141317, 0.0884434 , 0.11247049,
0.14748637, 0.13261876, -0.07357208, -0.11968047, -0.22124515,
0.12290633, 0.16602683, 0.01055585, 0.04445777, -0.11142147,
0.00004863, 0.22543314, -0.14342701, -0.23209116, -0.00003538,
0.19272381, -0.13767233, 0.04850799, -0.281997 , 0.10343244,
0.16510887, 0.08671653, -0.24125539, 0.01201926, 0.0995285 ,
0.09807415, -0.06764816, -0.0206733 , 0.04697794, 0.02000999,
0.05817033, 0.10478792, 0.0974884 , -0.01756372, -0.2466861 ,
0.02877498, 0.02499748, -0.00370895, -0.04728201, 0.00107118,
-0.21848503, 0.2033032 , -0.00076264, 0.03828803, -0.2929495 ,
-0.18218371, 0.00628893, 0.20586628, 0.2410889 , 0.02364616,
-0.05220835, -0.07040054, -0.03744286, -0.06718048, 0.19264086,
-0.06490505, 0.27364203, 0.05527219, -0.27494466, 0.22256687,
0.10330909, -0.3076979 , 0.04852265, 0.07411488, 0.23980476,
0.1590279 , -0.26712465, 0.07580928, 0.05644221, -0.18824042],
现在我的困惑是 Fastext 也为单词的字符 ngram 创建向量。因此,对于“内部”单词,它将为其所有字符 ngram 创建向量,包括整个单词,然后该单词的最终单词向量是其字符 ngram 的总和。
但是,它怎么还能给我一个单词甚至整个句子的向量呢?fasttext 向量不是用于单词及其 ngram 吗?那么当它显然是两个词时,我看到的这些向量是什么?