2

目前我正在从excel文件中读取文本并将bigram应用到它。finalList具有以下示例代码中使用的列表,具有从输入 excel 文件中读取的输入单词列表。

在以下库的帮助下从输入中删除了停用词:

from nltk.corpus import stopwords

应用于单词输入文本列表的二元逻辑

bigram=ngrams(finalList ,2)

输入文本:我完成了端到端的流程。

当前输出:已完成结束,结束结束,结束过程。

期望的输出:完成端到端、端到端的过程。

这意味着像 (end-to-end) 这样的一组词应该被视为 1 个词。

4

1 回答 1

1

要解决您的问题,您必须使用正则表达式清理停用词。看这个例子:

 import re
 text = 'I completed my end-to-end process..:?' 
 pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
 new_text = re.sub(pattern, '', text)
 print(new_text)
 'I completed my end-to-end process'


 # Now you can generate bigrams manually.
 # 1. Tokanize the new text
 tok = new_text.split()
 print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
 ['I', 'completed', 'my', 'end-to-end', 'process']

 # 2. Loop over the list and generate bigrams, store them in a var called bigrams
 bigrams = []
 for i in range(len(tok) - 1):  # -1 to avoid index error
     bigram = tok[i] + ' ' + tok[i + 1]  
     bigrams.append(bigram)


 # 3. Print your bigrams
 for bi in bigrams:
     print(bi, end = ', ')

I completed, completed my, my end-to-end, end-to-end process,

我希望这有帮助!

于 2017-10-12T22:43:10.407 回答