python - nltk 在生成三元组时不插入句尾符号

Question

我正在使用 Kneser-Ney 平滑从 Hobbit 生成文本。我的模型正在生成句子，但我相信还有改进的余地。

目前，我没有使用符号来标记句子的开头和结尾。当我尝试使用下面的代码插入它们时，我只能看到句子符号的第一个开头，但不知何故，对于其余的句子，符号没有插入。几乎就好像它根本没有检测到句子的结尾。

我尝试不将文本转换为小写，但它没有改变任何东西。

你能告诉我如何插入句末符号吗？

with open ("hobbit.txt") as f:
     hobbit_text = f.read()

hobbit_text = word_tokenize(hobbit_text.lower())

stop_words = stopwords.words('english')
personal_names = ['legolas', 'gimli', 'boromir', 'frodo', 'thorin', 'thror', 'gandalf', 'smeagol', 'gollum', 'balin', 'elrond','aragorn','bilbo', 'sauron']
signs = ['”','“', '!', '?', '’', '`', "'", '``', ',', ";", "(", ")"]

use_stop_words = True
use_punctuation = False
# get rid of stop words, punctuation (if necessary)
if not use_stop_words:
   hobbit_text = [x for x in hobbit_text if x not in stop_words]
if not use_punctuation:
   hobbit_text = [x for x in hobbit_text if x not in signs]

vocab = set(hobbit_text)

counter = 0
hobbit_trigram = ngrams(hobbit_text, 3, pad_left=True, pad_right=True, left_pad_symbol='BOS', right_pad_symbol='EOS')

for a in hobbit_trigram:
   print(a)
   counter += 1
   if counter == 100:
      break

第一句话的输出如下所示。我期待“gold”一词之后的句尾符号。

('BOS', 'BOS', 'the')
('BOS', 'the', 'hobbit')
('the', 'hobbit', 'or')
('hobbit', 'or', 'there ')
('or', 'there', 'and')
('there', 'and', 'back')
('and', 'back', 'again')
('back', 'again', 'jrr')
('again', 'jrr', '.')
('jrr', '.', 'tolkien')
('.', 'tolkien', 'the')
('tolkien', 'the ', 'hobbit')
('the', 'hobbit', 'is')
('hobbit', 'is', 'a')
('is', 'a', 'tale')
('a', 'tale', 'of')
('tale', 'of', 'high')
('of', 'high', 'adventure')
('high', 'adventure', 'undertaken')
('adventure', 'undertaken', 'by')
('undertaken', 'by', 'a')
('by', 'a', 'company' ')承担', 'by', 'a') ('by', 'a', 'company')承担', 'by', 'a') ('by', 'a', 'company')
('a', 'company', 'of')
('company', 'of', 'dwarves')
('of', 'dwarves', 'in')
('dwarves', 'in', '搜索')
('in', 'search', 'of')
('search', 'of', 'dragon-guarded')
('of', 'dragon-guarded', 'gold')
('dragon-guarded ', '黄金', '.')
('黄金', '.', 'a')

score 0 · Accepted Answer

尝试执行以下方式：

from functools import partial
from nltk import ngrams

padded_ngrams = partial(ngrams, pad_left=True, pad_right=True, left_pad_symbol='BOS', right_pad_symbol='EOS')

padded_hobbit_text = list(padded_ngrams(hobbit_text, 3))

# now print your value to see if it's what you want
print(padded_hobbit_text)

# with an input of "TEXT", it gave me the following output
'''
[('BOS', 'BOS', 'T'),
 ('BOS', 'T', 'E'),
 ('T', 'E', 'X'),
 ('E', 'X', 'T'),
 ('X', 'T', 'EOS'),
 ('T', 'EOS', 'EOS')]
'''

我试过这样做，它给了我方便的格式，就像你在问题中提出的那样。

python - nltk 在生成三元组时不插入句尾符号

1 回答 1

尝试执行以下方式：

Related

Reference