nlp - 我可以使用 spacy 进行标记，然后使用 fastext 的预训练词嵌入提取这些标记的向量吗

Question

我正在使用 spacy 的德语模型标记我的德语文本语料库。由于目前 spacy 只有小型德国模型，我无法使用 spacy 本身提取词向量。所以，我从这里使用 fasttext 的预训练词嵌入：https ://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning

现在 facebook 在为其提取词嵌入之前已经使用 ICU 标记器进行标记化过程。我正在使用 spacy 有人可以告诉我这是否可以吗？我觉得 spacy 和 ICU 分词器的行为可能会有所不同，如果是这样，那么我的文本语料库中的许多令牌将没有相应的词向量

感谢您的帮助！

score 2 · Accepted Answer

UPDATE:

I tried the above method and after extensive testing, I found that this works well for my use case. Most(almost all) of the tokens in my data matched the tokens present in fasttext ans I was able to obtain the word vectors representation for the same.

nlp - 我可以使用 spacy 进行标记，然后使用 fastext 的预训练词嵌入提取这些标记的向量吗

1 回答 1

Related

Reference