python-3.x - 比较两个文本文件时，如何使标记化不将缩略词及其对应部分视为相同？

Question

我目前正在研究一种数据结构，该结构应该比较两个文本文件并列出它们共有的字符串。我的程序将两个文件的内容作为两个字符串 a 和 b 接收（每个变量一个文件的内容）。然后，我在 for 循环中使用 tokenize 函数按每个句子分隔字符串。然后将它们存储到一个集合中以避免重复条目。在比较它们之前，我删除了每个变量中的所有重复行。然后我将这两个变量相互比较，只保留它们共有的字符串。当他们相互比较时，我在最后一部分出现了一个错误。该程序将在不应该的情况下将宫缩及其适当的对应部分视为相同。例如，它会读作“不应该”和“不应该”，并且会产生错误的答案。

import nltk

def sentences(a, b): #the variables store the contents of the files in the form of strings
 a_placeholder = a
 set_a = set()
 a = []
 for punctuation_a in nltk.sent_tokenize(a_placeholder): 
  if punctuation_a not in set_a:
   set_a.add(punctuation_a)
   a.append(punctuation_a)

 b_placeholder = b
 set_b = set()
 b = []
 for punctuation_b in nltk.sent_tokenize(b_placeholder):
  if punctuation_b not in set_b:
   set_b.add(punctuation_b)
   b.append(punctuation_b)

 a_new = a
 for punctuation in a_new:
  if punctuation not in set_b:
   set_a.remove(punctuation)
   a.remove(punctuation)
  else:
   pass

 return []

python-3.x - 比较两个文本文件时，如何使标记化不将缩略词及其对应部分视为相同？

0 回答 0

Related

Reference