python - 用 spacy 覆盖 scikitlearn 向量器的标记器

Question

我想用Spacy包实现词形还原。这是我的代码：

regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))

def custom_tokenizer(document):
    doc_spacy = en_nlp(document)
    return [token.lemma_ for token in doc_spacy]

lemma_tfidfvect = TfidfVectorizer(tokenizer= custom_tokenizer,stop_words = 'english')

但是当我运行该代码时发生了此错误消息。

C:\Users\yu\Anaconda3\lib\runpy.py:193: DeprecationWarning: Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the `words` keyword argument, for example:
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[...])
  "__main__", mod_spec)

我怎么解决这个问题？

score 0 · Accepted Answer

要自定义 spaCy 的分词器，您需要向它传递一个字典列表，该列表指定需要自定义分词的单词以及应该拆分的单词。这是文档中的示例代码：

from spacy.attrs import ORTH, LEMMA
case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)

如果您这样做是因为您想要制作一个自定义的lemmatizer，那么您最好直接创建一个自定义的 lemma 列表。您必须修改 spaCy 本身的语言数据，但格式非常简单：

"dustiest": ("dusty",),
"earlier": ("early",),
"earliest": ("early",),
"earthier": ("earthy",),
...

这些文件以英语保存在这里。

score 0 · Accepted Answer

我认为你的代码运行良好，你只是得到一个DeprecationWarning，这并不是一个真正的错误。

按照警告给出的建议，我认为您可以修改代码替换

en_nlp.tokenizer = lambda string: Doc(en_nlp.vocab, words = regexp.findall(string))

这应该运行良好，没有警告（今天在我的机器上运行）。

python - 用 spacy 覆盖 scikitlearn 向量器的标记器

2 回答 2

Related

Reference