0

我有一个 pandas DataFrame,其中每一列的每一行都有一个长字符串(参见变量“dframe”)。在单独的列表中,我存储了所有关键字,我必须将它们与 DataFrame 中每个字符串中的每个单词进行比较。如果找到关键字,我必须将其存储为成功并标记它,在哪个句子中找到它。我正在使用一个复杂的 for 循环,只有很少的“if”语句,这给了我正确的输出,但效率不高。在我有 130 个关键字和数千行要迭代的整个集合上运行需要将近 4 个小时。

我想应用一些 lambda 函数进行优化,这是我正在努力解决的问题。下面我向您介绍我的数据集和当前代码的想法。

import pandas as pd
from fuzzywuzzy import fuzz


dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                           'this is a second e-mail about money',
                           'this would be a next message where people talk about secret information',
                           'this is a sentence where someone misspelled word frad',
                           'this sentence has no keyword']})

keywords = ['fraud','money','secret']


keyword_set = set(keywords)

dframe['Flag'] = False
dframe['part_word'] = 0
output = []


for k in range(0, len(keywords)):
    count_ = 0
    dframe['Flag'] = False
    for j in range(0, len(dframe['Email'])):
        row_list = []
        print(str(k) + '  /  ' + str(len(keywords)) + '  ||  ' +  str(j) + '  /  ' + str(len(dframe['Email'])))
        for i in dframe['Email'][j].split():
            if dframe['part_word'][j] != 0 :
                row_list = dframe['part_word'][j]


            fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
            fuz_set = fuzz.token_set_ratio(keywords[k],i)

            if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
                if keywords[k] not in row_list:
                    row_list.append(keywords[k])
                    print(keywords[k] + '  found as :  ' + i)
                dframe['Flag'][j] = True
                dframe['part_word'][j] = row_list


    count_ = dframe['Flag'].values.sum()
    if count_ > 0:

        y = keywords[k] + ' ' + str(count_)
        output.append(y)
    else:
        y = keywords[k] + ' ' + '0'
        output.append(y)          

也许有 lambda 函数经验的人可以给我一个提示,我可以如何将它应用于我的 DataFrame 以执行类似的操作?在将每行的整个句子拆分为单独的单词并选择具有最高匹配值的值后,它需要以某种方式在 lambda 中应用模糊匹配,条件应该是更大的 85 或 90。这是我感到困惑的事情。提前感谢您的帮助。

4

1 回答 1

0

我没有适合你的 lambda 函数,但你可以应用一个函数dframe.Email

import pandas as pd
from fuzzywuzzy import fuzz

首先创建与您相同的示例数据框:

dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                       'this is a second e-mail about money',
                       'this would be a next message where people talk about secret information',
                       'this is a sentence where someone misspelled word frad',
                       'this sentence has no keyword']})

keywords = ['fraud','money','secret']

这是要应用的功能:

def fct(sntnc, kwds):
    mtch = []
    for kwd in kwds:
        fuz_part = [fuzz.partial_ratio(kwd.lower(), w.lower()) > 90 for w in sntnc.split()]
        fuz_set = [fuzz.token_set_ratio(kwd, w) > 85 for w in sntnc.split()]
        bL = [len(w) > 3 for w in sntnc.split()]
        mtch.append(any([(p | s) & l for p, s, l in zip(fuz_part, fuz_set, bL)]))
    return mtch

对于每个关键字,它计算fuz_part > 90句子中的所有单词,与 相同,fuz_set > 85 与 相同wordlength > 3((fuz_part > 90) | (fuz_set > 85)) & (wordlength > 3)最后,对于每个关键字,如果句子的所有单词中有任何关键字,它就会保存在一个列表中。

这就是它的应用方式和结果的创建方式:

s = dframe.Email.apply(fct, kwds=keywords)
s = s.apply(pd.Series).set_axis(keywords, axis=1, inplace=False)
dframe = pd.concat([dframe, s], axis=1)

结果:

result = dframe.drop('Email', 1)
#    fraud  money  secret
# 0   True   True   False                                    
# 1  False   True   False                                     
# 2  False  False    True                                    
# 3   True  False   False                                     
# 4  False  False   False              

result.sum()
# fraud     2
# money     2                                           
# secret    1                                           
# dtype: int64                         
于 2019-05-24T20:12:53.053 回答