python - 使用 python (pandas) 对 CSV 文件进行条件合并

Question

我正在尝试合并>=2具有相同架构的文件。
这些文件将包含重复的条目，但行不会相同，例如：

file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111

file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282

Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282

如果您注意到 : 9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999基于 store_id 和 phone 是相似的，但我从 file2 中选择了它，因为地址更具描述性。

store_id+phone_number是我的复合主键来查找位置并查找重复项（store_id 足以在上面的示例中找到它，但我需要一个基于多个列值的键）

问题：
- 我需要合并多个具有相同架构但具有重复行的 CSV 文件。
- 行级合并应该具有根据行的值选择行的特定值的逻辑。就像从文件 1 中提取的电话和从文件 2 中提取的地址一样。
- 1 个或多个列值的组合将定义行是否重复。

这可以使用熊猫来实现吗？

score 1 · Accepted Answer

将它们粉碎在一起的一种方法是使用合并（在 store_id 和编号上，如果这些是索引，那么这将是连接而不是合并）：

In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')

In [12]: res
Out[12]:
   store_id     address_x        phone           address_y
0      9191  9827 Park st    999999999  9827 Park st Apt82
1      8181  543 Hello st   1111111111                 NaN
2      7171           NaN  87282728282         912 John st

然后，您可以使用它where来选择address_y它是否存在，否则address_x：

In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)

In [14]: del res['address_x'], res['address_y']

In [15]: res
Out[15]: 
   store_id        phone             address
0      9191    999999999  9827 Park st Apt82
1      8181   1111111111        543 Hello st
2      7171  87282728282         912 John st

score 1 · Accepted Answer

concat使用, groupby,怎么样agg，然后你可以写一个 agg 函数来选择正确的值：

import pandas as pd
import io

t1 = """store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111"""

t2 = """store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282"""

df1 = pd.read_csv(io.BytesIO(t1))
df2 = pd.read_csv(io.BytesIO(t2))

df = pd.concat([df1, df2]).reset_index(drop=True)

def f(s):
    loc = s.str.len().idxmax()
    return s[loc]

df.groupby(["store_id", "phone"]).agg(f)

python - 使用 python (pandas) 对 CSV 文件进行条件合并

2 回答 2

Related

Reference