我正在使用两个数据框(df1 和 df2),我想根据名称匹配将 df2 合并到 df1 中,但两者之间的名称不完全匹配(例如:'JS Smith' 可能是“JS Smith ( Jr)") 和 df1 中的名称位于由“|”分隔的列表中 用于各种名称变体。
此外,我在 df2 中还有 1 列包含稍微不同的名称,如果原始列中没有匹配项,我想回退到这些名称。
最后,如果 df1 中有一个唯一匹配项,我只想从 df2 中引入数据,并且我不想覆盖之前引入的条目。
以下是 dfs 的示例:
df1(其中 N1 表示名称变量列表中的第一个名称)
Name variants
0 N1|N2|N3|N4
1 N1|N2|
2 N1|N2|N3
df2
Name Type 1 Name Type 2 Data1 Data2 Data3
0 Name 0 Name 0.1 X Y Z
1 Name 1 Name 1.1 A B C
2 Name 2 Name 2.1 D E F
我想先在“Name Type 2”上进行匹配假设匹配是:
- 名称 0.1 -> N1|N2 中的名称之一(df1 的第 1 行)
- 名称 2.1 -> N1|N2|N3|N4 中的名称之一(df1 的第 0 行)
- 名称 1.1 -> 不匹配 df1 中的任何名称,然后我会检查与 N1|N2|N3 匹配的名称 1(df1 的第 2 行)
生成的新 df 如下所示:
Name Variants Matched Named Data1 Data2 Data3 Matched
0 N1|N2|N3|N4 Name2.1 D E F True
1 N1|N2| Name0.1 X Y Z True
2 N1|N2|N3| Name1 A B C True
我目前的做法是:
- 遍历 df2 中的每一行并使用搜索 df1
df1[df1['Name Variants'].contains('Name0.1')
- 如果存在唯一匹配项(在 df1 中找到 1 行)并且“匹配”未标记为“真”,那么我会提取数据
- 如果有多个匹配项,我不会提取数据
- 如果没有匹配项,我使用相同的方法搜索“名称 0”并再次运行相同的逻辑(1 个匹配项,当前没有合并数据等)
我的问题是:
- 考虑到这是非常耗时的
- 我不匹配,因为我可以给出最初描述的轻微拼写差异
这是我当前方法的代码:
global_brands = set(ep["Global Brand"].dropna().str.replace("&", "").str.lower())
products = set(ep["Product"].dropna().str.replace("&", "").str.lower())
gx_name = set(ep["Generic Name"].dropna().str.replace(";","").str.lower())
#%%
print(len(global_brands))
print(len(products))
print(len(gx_name))
#%%
"""
add transformed names to ep and db
"""
ep["alt_global_brands"] = ep["Global Brand"].fillna("").str.replace("&", "").str.lower()
ep["alt_product"] = ep["Product"].fillna("").str.replace("&", "").str.lower()
ep["alt_gx_name"] = ep["Generic Name"].fillna("").str.replace(";","").str.lower()
db["alt_drug_names"] = db["Trans Drug Name"].str.lower()
#%%
print(db.loc[1805,"alt_drug_names"].split("|")[0] == "buprenorphine naloxone")
#%%
print(ep.loc[166,"alt_product"] == "vx-661 ivacaftor")
#%%
ep['Match in db'] = ""
db['EP match'] = ""
num_product_nonmatches = 0
num_product_exact_matches = 0
double_matches = 0
for product in products:
product_matches = len(db.ix[db["alt_drug_names"].str.contains(product)])
if product_matches == 1:
matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]
if product_matches > 1:
#print(db.ix[db["alt_drug_names"].str.contains(global_brand)]["alt_drug_names"].str.split("|"))
num_matched_rows = 0
for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
names = value.split("|")
for name in names:
if product == name:
matched_row = row
num_matched_rows += 1
if num_matched_rows == 1:
product_matches = 1
#elif num_matched_rows > 1: - At no point was there still a double match after looping through each rows name variants and looking for an exact match
if num_matched_rows == 0:
"""
Here after looping through the name variants there was no exact match
This seems to be for assets that are too generic (ex: clonidine hydrochloride, rotavirus vaccine, etc.)
Approach:
1. Check if name has / to split and create combo
2. If no / or still no match => leverage generic name
"""
product_copy = product
if "(" in product:
product = product.split("(")[0].strip()
if "/" in product:
product_split = product.split("/")
for product_fragment in product_split:
product_fragment = product_fragment.strip()
temp_product = ""
for product_fragment in product_split:
temp_product = temp_product + product_fragment + " "
product = temp_product[:-len(" ")].strip()
if len(db.ix[db["alt_drug_names"].str.contains(product)]) == 1: # this instance does not occur
product_matches = 1
matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]
elif len(db.ix[db["alt_drug_names"].str.contains(product)]) > 1:
num_matched_rows = 0
for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
names = value.split("|")
for name in names:
if product == name:
matched_row = row
num_matched_rows += 1
if num_matched_rows == 1:
product_matches = 1
product = product_copy
if product_matches == 0:
num_product_nonmatches += 1
"""
Check if name has / to split and create combo
LEVERAGE GENERIC NAME
"""
#product_name = ep[ep["Global Brand"].str.replace("&", "+")]
#product_matches = len(db.ix[db["Drug Name"].str.contains(global_brand) and db.ix[db["Drug Name"].str.contains(global_brand)])
if product_matches == 1:
num_product_exact_matches += 1
# print(product)
# print(matched_row)
#print(product)
ep_row = ep[ep['alt_product'] == product].index[0]
if ep.loc[ep_row,'Match in db'] == "":
ep.loc[ep_row,'Match in db'] = "TRUE"
if db.loc[matched_row,'EP match'] == "":
db.loc[matched_row, 'EP match'] = "TRUE"
db.loc[matched_row, 'EP Global Name'] = ep.loc[ep_row, 'Global Brand']
db.loc[matched_row, 'EP Product'] = ep.loc[ep_row, 'Product']
db.loc[matched_row, 'EP Generic Name'] = ep.loc[ep_row, 'Generic Name']
db.loc[matched_row, 'EP Company'] = ep.loc[ep_row, 'Company']
db.loc[matched_row, 'EP Rx or OTC'] = ep.loc[ep_row, 'Prescription']
db.loc[matched_row, 'EP markets'] = ep.loc[ep_row, 'Markets']
columns = ['2015 Actual/ Est. (Sales)','WW sales - 2008','WW sales - 2009','WW sales - 2010','WW sales - 2011','WW sales - 2012','WW sales - 2013','WW sales - 2014','WW sales - 2015',
'WW sales - 2016','WW sales - 2017','WW sales - 2018','WW sales - 2019','WW sales - 2020','WW sales - 2021','WW sales - 2022','WW sales - 2023','WW sales - 2024','WW sales - 2025',
'WW CAGR (2018 or Launch - 2025)','WW Est. Launch','U.S. sales - 2008','U.S. sales - 2009','U.S. sales - 2010','U.S. sales - 2011','U.S. sales - 2012','U.S. sales - 2013',
'U.S. sales - 2014','U.S. sales - 2015','U.S. sales - 2016','U.S. sales - 2017','U.S. sales - 2018','U.S. sales - 2019','U.S. sales - 2020','U.S. sales - 2021','U.S. sales - 2022',
'U.S. sales - 2023','U.S. sales - 2024','U.S. sales - 2025','U.S. CAGR (2018 or Launch - 2025)','Forecasters','Forecast Statistics']
for col in columns:
db.loc[matched_row, col] = ep.loc[ep_row, col]
db.loc[matched_row, 'U.S. Est. Launch'] = ep.loc[ep_row,'U.S. Est. Lauch']
#%%
print("EP non matches: " + str(num_product_nonmatches))
print("EP matches: " + str(num_product_exact_matches))
print("EP total: " + str(num_product_nonmatches + num_product_exact_matches))
print("EP total products: " + str(len(ep)))
print("EP length of product set: " + str(len(products)))
print("EP double_matches: " + str(double_matches))