0

我正在使用 NLTK 包装器使用斯坦福 3class 模型进行 NER 标记。在用英语编写的 BBC 新闻原始文本中出现 UnicodeDecodeError。

这是我的代码

from nltk.tag import StanfordNERTagger
st1 = StanfordNERTagger('/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '/home/saurabh/saurabh-cair/stanford-ner-2018-10-16/stanford-ner.jar', encoding='utf-8')
file=open('/home/saurabh/saurabh-cair/model_training/bbc/data.txt','rt')
text=file.read()
file.close()

import nltk
words = nltk.word_tokenize(text)
xyz=st1.tag(words)

for i in xyz:   
    print(i)  

错误为

Traceback (most recent call last):
  File "model_english.py", line 26, in <module>
    words = nltk.word_tokenize(text)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1241, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1281, in span_tokenize
    for sl in slices:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1322, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 314, in _pair_iter
    for el in it:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1297, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1343, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1478, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
    prev = next(it)
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 584, in _annotate_first_pass
    for aug_tok in tokens:
  File "/home/saurabh/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 550, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)  

我也尝试了 utf-8、ascii 和默认编码,但它并没有解决我的问题。

文本数据包含以下句子:

General Motors of the US is to pay Fiat 1.55bn euros ($2bn; £1.1bn) to get out of a deal which could have forced it to buy the Italian car maker outright.  

我正在使用 Anaconda python 2.7

4

0 回答 0