python - 如何从 Pandas 数据框中删除行列表？

Question

我有一个数据框 df ：

>>> df
                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20060630   6.590       NaN      6.590   5.291
       20060930  10.103       NaN     10.103   7.981
       20061231  15.915       NaN     15.915  12.686
       20070331   3.196       NaN      3.196   2.710
       20070630   7.907       NaN      7.907   6.459

然后我想删除具有列表中指示的某些序列号的行，假设这里是[1,2,4],剩下的：

                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20061231  15.915       NaN     15.915  12.686
       20070630   7.907       NaN      7.907   6.459

如何或什么功能可以做到这一点？

score 477 · Accepted Answer

使用DataFrame.drop并向其传递一系列索引标签：

In [65]: df
Out[65]: 
       one  two
one      1    4
two      2    3
three    3    2
four     4    1


In [66]: df.drop(df.index[[1,3]])
Out[66]: 
       one  two
one      1    4
three    3    2

score 144 · Accepted Answer

请注意，当您想要进行插入时，使用“inplace”命令可能很重要。

df.drop(df.index[[1,3]], inplace=True)

因为你原来的问题没有返回任何东西，所以应该使用这个命令。 http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html

score 69 · Accepted Answer

如果 DataFrame 很大，并且要删除的行数也很大，那么简单的按索引删除df.drop(df.index[])会花费太多时间。

就我而言，我有一个带有的浮点数的多索引 DataFrame 100M rows x 3 cols，我需要从中删除10k行。我发现的最快的方法是，非常违反直觉的是，take剩余的行。

让indexes_to_drop是要删除的位置索引数组（[1, 2, 4]在问题中）。

indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))

在我的情况下，这需要20.5s，而简单的df.drop需要5min 27s并消耗大量内存。生成的 DataFrame 是相同的。

score 48 · Accepted Answer

您还可以将标签本身（而不是索引标签系列）传递给DataFrame.drop ：

In[17]: df
Out[17]: 
            a         b         c         d         e
one  0.456558 -2.536432  0.216279 -1.305855 -0.121635
two -1.015127 -0.445133  1.867681  2.179392  0.518801

In[18]: df.drop('one')
Out[18]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

这相当于：

In[19]: df.drop(df.index[[0]])
Out[19]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

score 42 · Accepted Answer

我以更简单的方式解决了这个问题 - 只需 2 个步骤。

使用不需要的行/数据制作数据框。
使用此不需要的数据帧的索引从原始数据帧中删除行。

示例：
假设您有一个数据框 df ，其中包含整数的“年龄”列。现在假设您要删除所有以“年龄”为负数的行。

df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2

希望这更简单并且对您有所帮助。

score 17 · Accepted Answer

如果我想删除一个比方说 index 的行x，我会执行以下操作：

df = df[df.index != x]

如果我想删除多个索引（比如这些索引在列表中unwanted_indices），我会这样做：

desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

score 11 · Accepted Answer

这是一个有点具体的例子，我想展示一下。假设您的某些行中有许多重复条目。如果您有字符串条目，您可以轻松地使用字符串方法来查找要删除的所有索引。

ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index

现在使用它们的索引删除这些行

new_df = df.drop(ind_drop)

score 4 · Accepted Answer

仅使用索引 arg 删除行：-

df.drop(index = 2, inplace = True)

对于多行：-

df.drop(index=[1,3], inplace = True)

score 3 · Accepted Answer

在对@theodros-zelleke 的回答的评论中，@j-jones 询问了如果索引不是唯一的该怎么办。我不得不处理这样的情况。我所做的是在调用之前重命名索引中的重复项drop()，例如：

dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)

rename_duplicates()我定义的函数在哪里遍历索引元素并重命名重复项。我使用了与列上使用的相同的重命名模式pd.read_csv()，即，"%s.%d" % (name, count)其中name是行的名称，count是之前发生的次数。

score 3 · Accepted Answer

如上所述，从布尔值确定索引，例如

df[df['column'].isin(values)].index

可能比使用此方法确定索引更占用内存

pd.Index(np.where(df['column'].isin(values))[0])

像这样应用

df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)

此方法在处理大型数据帧和有限内存时很有用。

score 2 · Accepted Answer

看下面的数据框 df

df

   column1  column2  column3
0        1       11       21
1        2       12       22
2        3       13       23
3        4       14       24
4        5       15       25
5        6       16       26
6        7       17       27
7        8       18       28
8        9       19       29
9       10       20       30

让我们删除 column1 中所有奇数的行

创建 column1 中所有元素的列表，并仅保留那些为偶数的元素（您不想删除的元素）

keep_elements = [x for x in df.column1 if x%2==0]

[2, 4, 6, 8, 10]将保留或不删除其 column1 中具有值的所有行。

df.set_index('column1',inplace = True)
df.drop(df.index.difference(keep_elements),axis=0,inplace=True)
df.reset_index(inplace=True)

我们将 column1 作为索引并删除所有不需要的行。然后我们重新设置索引。 df

   column1  column2  column3
0        2       12       22
1        4       14       24
2        6       16       26
3        8       18       28
4       10       20       30

score 2 · Accepted Answer

要删除索引为 1、2、4 的行，您可以使用：

df[~df.index.isin([1, 2, 4])]

波浪号运算符~否定方法的结果isin。另一种选择是删除索引：

df.loc[df.index.drop([1, 2, 4])]

score 0 · Accepted Answer

考虑一个示例数据框

df =     
index    column1
0           00
1           10
2           20
3           30

我们要删除第 2 和第 3 个索引行。

方法一：

df = df.drop(df.index[2,3])
 or 
df.drop(df.index[2,3],inplace=True)
print(df)

df =     
index    column1
0           00
3           30

 #This approach removes the rows as we wanted but the index remains unordered

方法二

df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =     
index    column1
0           00
1           30
#This approach removes the rows as we wanted and resets the index.

score 0 · Accepted Answer

正如丹尼斯戈洛马佐夫的回答所暗示的那样，使用drop删除行。您可以选择保留行。假设您有一个要删除的行索引列表，称为indices_to_drop. 您可以将其转换为掩码，如下所示：

mask = np.ones(len(df), bool)
mask[indices_to_drop] = False

您可以直接使用此索引：

df_new = df.iloc[mask]

这个方法的好处是它mask可以来自任何来源：它可以是一个涉及许多列的条件，或者其他东西。

真正好的事情是，您根本不需要原始 DataFrame 的索引，因此索引是否唯一无关紧要。

缺点当然是你不能用这种方法就地放置。

python - 如何从 Pandas 数据框中删除行列表？

14 回答 14

Related

Reference