dataframe - fuzz.token_set_ratio 的 Python 替代函数以减少执行时间

Question

我正在处理名称匹配问题，其中我有需要与保存在 csv 文件中的 250 万条现有客户记录进行比较的客户名称。下面是我尝试过的代码，单名匹配需要 5-12 分钟。由于这将作为 API 与 RPA 流程集成，建议我在一两分钟内以任何其他方式实现相同的目标。

from fuzzywuzzy import fuzz
import time

# names is the list passed to the program as parameter
names_with_sno = [[sno, name] for sno, name in enumerate(names, 1)]

# dataframe created for the given customer names
df1 = pd.DataFrame(names_with_sno, columns=['s_no','SDN_NAME_SERACH'])

# dataframe for customer database via csv
cust_2 = pd.read_csv(r'...\customer-database-extract\extract.CSV')

# .... preprocessing of both the dataframes
# .... which are not time consuming ones

### CROSS JOIN
#doing the cross join between the given names and customer database
#creating common key in the dataframe having the given names
df1["key"]=1

#creating common key in customer db dataset
cust_2["key"]=1

#sdropping the common column key after creating the cross join
final_df = pd.merge(df1,cust_2,on="key").drop("key",1)   

**def get_ratio(df):
    cust_name=df["FIRST_NAME"]
    hit_name=df["SDN_NAME_SERACH"]
    return fuzz.token_set_ratio(cust_name,hit_name)**

st = time.mktime(time.localtime())

#applying the function for name _mtahcing and storing it in a series
**final_series = final_df.apply(get_ratio,axis=1)**

print('\n\nt23 - df.apply(get_ratio) - ',secondsToText(time.mktime(time.localtime()) - st))

在这里，df1 是给定名称的数据框，cust_2 是从 csv 文件中读取的 DB 提取。印刷品给出的时间为，

t23 - df.apply(get_ratio) - 5.0 分 42.0 秒

dataframe - fuzz.token_set_ratio 的 Python 替代函数以减少执行时间

0 回答 0

Related

Reference