我正在处理名称匹配问题,其中我有需要与保存在 csv 文件中的 250 万条现有客户记录进行比较的客户名称。下面是我尝试过的代码,单名匹配需要 5-12 分钟。由于这将作为 API 与 RPA 流程集成,建议我在一两分钟内以任何其他方式实现相同的目标。
from fuzzywuzzy import fuzz
import time
# names is the list passed to the program as parameter
names_with_sno = [[sno, name] for sno, name in enumerate(names, 1)]
# dataframe created for the given customer names
df1 = pd.DataFrame(names_with_sno, columns=['s_no','SDN_NAME_SERACH'])
# dataframe for customer database via csv
cust_2 = pd.read_csv(r'...\customer-database-extract\extract.CSV')
# .... preprocessing of both the dataframes
# .... which are not time consuming ones
### CROSS JOIN
#doing the cross join between the given names and customer database
#creating common key in the dataframe having the given names
df1["key"]=1
#creating common key in customer db dataset
cust_2["key"]=1
#sdropping the common column key after creating the cross join
final_df = pd.merge(df1,cust_2,on="key").drop("key",1)
**def get_ratio(df):
cust_name=df["FIRST_NAME"]
hit_name=df["SDN_NAME_SERACH"]
return fuzz.token_set_ratio(cust_name,hit_name)**
st = time.mktime(time.localtime())
#applying the function for name _mtahcing and storing it in a series
**final_series = final_df.apply(get_ratio,axis=1)**
print('\n\nt23 - df.apply(get_ratio) - ',secondsToText(time.mktime(time.localtime()) - st))
在这里,df1 是给定名称的数据框,cust_2 是从 csv 文件中读取的 DB 提取。印刷品给出的时间为,
t23 - df.apply(get_ratio) - 5.0 分 42.0 秒