1

我有两个 pandas 数据框,在 python 中包含数百万行。我想根据三个条件从包含以秒为单位的数据帧的单词的第一个数据帧中删除行:

  1. 如果单词连续出现在句子的开头
  2. 如果该单词连续出现在句子的末尾
  3. 如果单词连续出现在句子的中间(确切的单词,而不是子集)

例子:

第一个数据框:

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence 

第二个数据框:

Second
forth
fifth

预期输出:

This is the first sentence
This is fifth_sentence 

请注意,我在两个数据框中都有数百万条记录,如何以最有效的方式处理和导出?

我试过了,但需要很长时间

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ", i, "\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)

谢谢

4

1 回答 1

3

您可以使用numpy.where函数并创建一个名为“remove”的变量,如果满足您概述的条件,它将标记为 1。首先,创建一个包含以下值的列表df2

条件1:将检查单元格值是否以列表中的任何值开头

条件 2:与上面相同,但它会检查单元格值是否以列表中的任何值结尾

条件 3:拆分每个单元格并检查拆分器字符串中的任何值是否在您的列表中

此后,您可以通过过滤掉以下内容来创建新数据框1

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out印刷:

                          col
0  This is the first sentence
4      This is fifth_sentence

参考:

熊猫行选择字符串以列表中任何项目开头的位置

检查列是否包含列表中的任何 str

使用的数据框:

>>> df.to_dict()

{'col': {0: 'This is the first sentence',
  1: 'Second this is another sentence',
  2: 'This is the third sentence forth',
  3: 'This is fifth sentence',
  4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}
于 2021-06-11T09:12:27.177 回答