python - 自定义排序熊猫数据框

Question

我有一个使用 pandas.DataFrame 的（非常大的）表。它包含来自文本的字数；索引是单词表：

             one.txt  third.txt  two.txt
a               1          1        0
i               0          0        1
is              1          1        1
no              0          0        1
not             0          1        0
really          1          0        0
sentence        1          1        1
short           2          0        0
think           0          0        1

我想根据所有文本中单词的频率对单词列表进行排序。所以我可以轻松地创建一个包含每个单词频率和的系列（使用单词作为索引）。但是我如何才能在这个列表中排序呢？

一种简单的方法是将列表作为列添加到数据框中，对其进行排序，然后将其删除。出于性能原因，我想避免这种情况。

这里描述了另外两种方法，但是一种复制数据帧，这是因为它的大小而存在问题，另一种创建一个新索引，但我需要进一步了解这些单词的信息。

score 2 · Accepted Answer

您可以计算频率并使用该sort方法找到所需的索引顺序。然后使用df.loc[order.index]重新排序原始DataFrame：

order = df.sum(axis=1).sort(inplace=False)
result = df.loc[order.index]

例如，

import pandas as pd

df = pd.DataFrame({
    'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0],
    'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0],
    'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]}, 
    index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think'])

order = df.sum(axis=1).sort(inplace=False, ascending=False)
print(df.loc[order.index])

产量

          one.txt  third.txt  two.txt
sentence        1          1        1
is              1          1        1
short           2          0        0
a               1          1        0
think           0          0        1
really          1          0        0
not             0          1        0
no              0          0        1
i               0          0        1

python - 自定义排序熊猫数据框

1 回答 1

Related

Reference