python - 在 pandas DataFrame 中查找列的值最大的行

Question

如何找到特定列的值最大的行？

df.max()会给我每列的最大值，我不知道如何获取相应的行。

score 314 · Accepted Answer

Use the pandas idxmax function. It's straightforward:

>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
          A         B         C
0  1.232853 -1.979459 -0.573626
1  0.140767  0.394940  1.068890
2  0.742023  1.343977 -0.579745
3  2.125299 -0.649328 -0.211692
4 -0.187253  1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1

Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).

HISTORICAL NOTES:

idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.

For example, consider this toy DataFrame with a duplicate row label:

In [19]: dfrm
Out[19]: 
          A         B         C
a  0.143693  0.653810  0.586007
b  0.623582  0.312903  0.919076
c  0.165438  0.889809  0.000967
d  0.308245  0.787776  0.571195
e  0.870068  0.935626  0.606911
f  0.037602  0.855193  0.728495
g  0.605366  0.338105  0.696460
h  0.000000  0.090814  0.963927
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

In [20]: dfrm['A'].idxmax()
Out[20]: 'i'

In [21]: dfrm.iloc[dfrm['A'].idxmax()]  # .ix instead of .iloc in older versions of pandas
Out[21]: 
          A         B         C
i  0.688343  0.188468  0.352213
i  0.879000  0.105039  0.900260

So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).

This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.

So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.

score 92 · Accepted Answer

您也可以尝试idxmax：

In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])

In [6]: df
Out[6]: 
          A         B         C
0  2.001289  0.482561  1.579985
1 -0.991646 -0.387835  1.320236
2  0.143826 -1.096889  1.486508
3 -0.193056 -0.499020  1.536540
4 -2.083647 -3.074591  0.175772
5 -0.186138 -1.949731  0.287432
6 -0.480790 -1.771560 -0.930234
7  0.227383 -0.278253  2.102004
8 -0.002592  1.434192 -1.624915
9  0.404911 -2.167599 -0.452900

In [7]: df.idxmax()
Out[7]: 
A    0
B    8
C    7

例如

In [8]: df.loc[df['A'].idxmax()]
Out[8]: 
A    2.001289
B    0.482561
C    1.579985

score 31 · Accepted Answer

如果有多行取最大值，上述两个答案都只会返回一个索引。如果你想要所有的行，似乎没有一个功能。但这并不难做到。以下是系列的示例；对 DataFrame 也可以这样做：

In [1]: from pandas import Series, DataFrame

In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])

In [3]: s.idxmax()
Out[3]: 'b'

In [4]: s[s==s.max()]
Out[4]: 
b    4
c    4
dtype: int64

score 21 · Accepted Answer

df.iloc[df['columnX'].argmax()]

argmax()将提供与 columnX 的最大值对应的索引。iloc可用于获取此索引的 DataFrame df 的行。

score 7 · Accepted Answer

非常简单：我们有如下的df，我们想在C中打印一行最大值：

在：

df.loc[df['C'] == df['C'].max()]   # condition check

出去：

A B C
y 2 10

score 7 · Accepted Answer

使用query()的更紧凑和可读的解决方案是这样的：

import pandas as pd

df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)

# find row with maximum A
df.query('A == A.max()')

它还返回一个 DataFrame 而不是 Series，这对于某些用例来说很方便。

score 4 · Accepted Answer

直接的“.argmax()”解决方案对我不起作用。

@ely提供的上一个示例

>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
      A         B         C
0  1.232853 -1.979459 -0.573626
1  0.140767  0.394940  1.068890
2  0.742023  1.343977 -0.579745
3  2.125299 -0.649328 -0.211692
4 -0.187253  1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1

返回以下消息：

FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax' 
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.

所以我的解决方案是：

df['A'].values.argmax()

score 3 · Accepted Answer

如果您想要整行而不仅仅是id，您可以使用df.nlargest并传入您想要的“顶部”行数，您还可以传入您想要的列/列。

df.nlargest(2,['A'])

将为您提供与的前 2 个值相对应的行A。

用于df.nsmallest最小值。

score 2 · Accepted Answer

mx.iloc[0].idxmax()

这行代码将告诉您如何从数据框中的一行中找到最大值，这里mx是数据框并iloc[0]指示第 0 个索引。

score 1 · Accepted Answer

DataFrameidmax返回具有最大值的行的标签索引，并且行为argmax取决于版本pandas（现在它返回警告）。如果要使用位置索引，可以执行以下操作：

max_row = df['A'].values.argmax()

或者

import numpy as np
max_row = np.argmax(df['A'].values)

请注意，如果您使用np.argmax(df['A'])的行为与df['A'].argmax().

score 1 · Accepted Answer

考虑到这个数据框

[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
          A         B         C
0 -0.253233  0.226313  1.223688
1  0.472606  1.017674  1.520032
2  1.454875  1.066637  0.381890
3 -0.054181  0.234305 -0.557915

假设一个想知道“C”列最大的行，以下将完成工作

[In]: df[df['C']==df['C'].max()])
[Out]:
          A         B         C
1  0.472606  1.017674  1.520032

score 0 · Accepted Answer

如果最大值有联系，则idxmax仅返回第一个最大值的索引。例如，在以下 DataFrame 中：

idxmax返回

A    0
B    3
C    0
dtype: int64

现在，如果我们想要所有索引对应于最大值，那么我们可以使用max+eq创建一个布尔数据帧，然后使用它df.index来过滤掉索引：

out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())

输出：

A       [0, 4]
B          [3]
C    [0, 1, 3]
dtype: object

score 0 · Accepted Answer

利用：

data.iloc[data['A'].idxmax()]

data['A'].idxmax()-根据行查找最大值位置 data.iloc() - 返回行

python - 在 pandas DataFrame 中查找列的值最大的行

13 回答 13

Related

Reference