python - Pandas：在两个数据框中查找匹配的行（不使用`merge`）

Question

假设我有这两个列数相同但行数可能不同的数据框：

tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

print(df)
   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

print(df2)
     a    b    c
0    3    4    5
1  100    4  100
2  101  100    3

我想验证的行df2是否与的任何行匹配 df，即我想获得一系列（或数组）的布尔值来给出这个结果：

0     True
1    False
2    False
dtype: bool

我认为类似该方法的isin方法应该可以工作，但是我得到了这个结果，这导致了一个数据框并且是错误的：

print(df2.isin(df))
       a      b      c
0  False  False  False
1  False  False  False
2  False  False  False

作为一个约束，我不希望使用该merge方法，因为我所做的实际上是在应用合并之前检查数据。感谢您的帮助！

score 5 · Accepted Answer

您可以使用numpy.isin，它将比较数组中的所有元素并为每个数组的每个元素返回True或。False

然后在每个数组上使用，如果所有元素都为真all()，则会在函数返回时获得所需的输出：True

>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

0     True
1    False
2    False
dtype: bool

正在发生的事情的细分：

# np.isin
>>> np.isin(df2.values,df.values)

Out[139]: 
array([[ True,  True,  True],
       [False,  True, False],
       [False, False,  True]])

# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]

Out[140]: [True, False, False]

# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

Out[141]: 
0     True
1    False
2    False
dtype: bool

score 1 · Accepted Answer

可能有更有效的解决方案，但您可以附加两个可以调用的数据框duplicated，例如：

df.append(df2).duplicated().iloc[df.shape[0]:]

这假设每个 DataFrame 中的所有行都是不同的。以下是一些基准：

tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1,  columns=["a", "b", "c"]) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()

%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop

score 1 · Accepted Answer

使用np.in1d：

>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0     True
1    False
2    False
dtype: bool

另一种方式，使用frozenset：

>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0     True
1    False
2    False
dtype: bool

score 1 · Accepted Answer

您可以使用 MultiIndex（昂贵的 IMO）：

pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])

另一种选择是创建一个字典，然后运行isin：

df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]: 
0     True
1    False
2    False
dtype: bool

score 0 · Accepted Answer

0

尝试：

df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]

这是我的结果：

于 2021-12-23T11:02:28.553 回答

python - Pandas：在两个数据框中查找匹配的行（不使用`merge`）

5 回答 5

Related

Reference