1

我有以下代码:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
sentence = "some example sentence here"
results = pipeline(sentence)

这很好用。但不是 a str,我不想传递 a listof 令牌。我怎么做?

我想这样做的原因是,我的句子已经被标记化并且简单" ".join()并不能正确地重现句子。例如,isn't已被标记为isn't。但是一个简单的" ".join()会产生is n't

4

1 回答 1

0

我假设原始数据由 NLTK 标记,所以尝试NLTK detokenizer

from nltk.tokenize.treebank import TreebankWordDetokenizer
toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
twd = TreebankWordDetokenizer()
twd.detokenize(toks)
# "hello, i can't feel my feet! Help!!"
于 2022-01-18T21:43:50.833 回答