我有两个数据框,每个数据框都有不同的行数。下面是来自每个数据集的几行
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
和
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
我并排使用combined_data = pandas.concat([df1, df2], axis = 1)
. 我的下一个目标是使用模块中的几个不同匹配命令将每个字符串与下面df1['Company']
的每个字符串进行比较,并返回最佳匹配的值及其名称。我想将其存储在新列中。例如,如果我执行and on ,它将返回最佳匹配的得分为,然后将其保存在新列下df2['FDA Company']
fuzzy wuzzy
fuzz.ratio
fuzz.token_sort_ratio
LACKY SHEET METAL
df1['Company']
df2['FDA Company']
LACKY SHEET METAL
100
combined data
. 结果看起来像
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
我试着做
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
但是由于列的长度不同而出现错误。
我难住了。我怎么能做到这一点?