pandas - 按关键字列表过滤行

Question

我有一个关键字列表（推广、想要、总是）。有时它可能不仅仅是作为关键字的单词。前任。“想要”我使用的数据集是training.1600000.processed.noemoticon.csv，可以在这里找到：https ://www.kaggle.com/kazanova/sentiment140

我需要知道列表中的哪个关键字出现在'Text'列的每一行中，无论它是整个字符串（例如"always"）还是子字符串（例如"alwaysfurst"），不区分大小写。例如，可以仅针对一行“促进” ，或者针对另一行同时“想要”和“始终”。所以我必须创建一个新列，该列应该包含每个创建的关键字（出现一次）。我只保留至少有一个关键字的行

这是代码：

%%time
header_list = ['A','No','Date','Query','User','Text']
df_p = pd.read_csv(r'C:\Users\User\Desktop\PYTHON\training.1600000.processed.noemoticon.csv', encoding='latin-1', names=header_list)
# create new column
df_p['long'] = df_p['Text'].str.lower().str.findall('promote|want to|always').apply(set).astype(str)
# delete rows without any keywords present in the new column
df_p.drop(df_p[df_p['long'] == 'set()'].index, inplace = True)

要检查新的df是否只包含列表中唯一的关键字组合，在“long”列上，我使用.value_counts()

df_p['long'].value_counts()

没关系。

Wall time: 12 s

{'want to'}               21888
{'always'}                14642
{'promote'}                 325
{'always', 'want to'}       197
{'promote', 'want to'}       11
{'promote', 'always'}         7
Name: long, dtype: int64

我尝试通过用“”更改第一行来使用Modinimport modin.pandas as pd ，但我得到了更长的时间（几乎是两倍）和一些警告

Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

from distributed import Client
client = Client()

我放弃了 Modin，我试图在“应用”之前插入“ swifter ” ，但我得到了"AttributeError: 'Series' object has no attribute '_is_builtin_func'"

有没有办法改进代码以获得更好的性能？或者另一种方式来做到这一点？（也许是 Vaex？）

pandas - 按关键字列表过滤行

0 回答 0

Related

Reference