我有很长的推文列表存储在 python 列表中(超过 50k)。我正处于比较每个项目与其他项目的阶段,以通过使用 difflib 找到推文之间的相似性(删除那些相似的 755 条,同时只保留一条相似的推文)。我使用 itertools.combinations 循环遍历所有项目,但花了很长时间(即几天)。这是我的代码:
import pandas as pd
from difflib import SequenceMatcher
import itertools
import re
import time
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
df1=pd.read_csv("50k_TweetSheet.csv")
data = df1['text'].tolist()
orginalData = data
outList = []
data[:] = [re.sub(r"http\S+", "", s) for s in data]
data[:] = [re.sub(r"@\S+", "", s) for s in data]
data[:] = [re.sub(r"RT|rt\S+", "", s) for s in data]
data[:] = [s.replace('\r+', ' ') for s in data]
data[:] = [s.replace('\n+', ' ') for s in data]
data[:] = [s.replace(' +', ' ') for s in data]
numOfRows = len(data)
start_time = time.time()
for a, b in itertools.combinations(range(numOfRows), 2):
if len(data[a].split()) < 4: continue
if a in outList: continue
similarity = similar(data[a],data[b])
if similarity > 0.75:
if len(data[a].split()) > len(data[b].split()):
outList.append(b)
print(data[a])
else:
outList.append(a)
print(data[b])
有更快的方法吗?