我有两个 pandas 数据框,在 python 中包含数百万行。我想根据三个条件从包含以秒为单位的数据帧的单词的第一个数据帧中删除行:
- 如果单词连续出现在句子的开头
- 如果该单词连续出现在句子的末尾
- 如果单词连续出现在句子的中间(确切的单词,而不是子集)
例子:
第一个数据框:
This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence
第二个数据框:
Second
forth
fifth
预期输出:
This is the first sentence
This is fifth_sentence
请注意,我在两个数据框中都有数百万条记录,如何以最有效的方式处理和导出?
我试过了,但需要很长时间
import pandas as pd
import re
bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)
bad_words_index = []
for i in sentences_file_data.index:
print("Processing Sentence:- ", i, "\n")
single_sentence = sentences_file_data[0][i]
for j in bad_words_file_data.index:
word = bad_words_file_data[0][j]
if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
bad_words_index.append(i)
break
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)
谢谢