python - Pandas 中 Series.add 函数的更快替代方案

Question

我正在尝试将两个熊猫系列加在一起。第一个系列非常大，并且有一个 MultiIndex。第二个系列的索引是第一个索引的一个小子集。

    df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
    df1 = pd.DataFrame(df1, columns = ['total'])
    df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]])  # df2 is tiny subset of df1

第一次使用常规的 Series.add 函数大约需要 9 秒，后续尝试需要 2 秒（可能是因为 pandas 优化了 df 在内存中的存储方式？）。

    starttime = time.time()
    df1.total.add(df2.total,fill_value=0).sum()
    print "Method 1 took %f seconds" % (time.time() - starttime)

手动迭代行的时间大约是第一次 Series.add 的 2/3，大约是 Series.add 后续尝试的 1/100。

    starttime = time.time()
    result = df1.total.copy()
    for row_index, row in df2.iterrows():
        result[row_index] += row
    print "Method 2 took %f seconds" % (time.time() - starttime)

当（如此处）索引是 MultiIndex 时，速度差异特别明显。

为什么 Series.add 在这里不能很好地工作？有什么建议可以加快速度吗？是否有更有效的替代方案来迭代系列的每个元素？

另外，如何对数据框进行排序或结构化以提高这两种方法的性能？第二次运行这两种方法中的任何一种都明显更快。如何在第一时间获得这种性能？使用 sort_index 进行排序只能起到很小的作用。

score 4 · Accepted Answer

4

您不需要 for 循环：

df1.total[df2.index] += df2.total

于 2013-11-08T00:45:26.517 回答

score 3 · Accepted Answer

正如 HYRY 回答的那样，在这种情况下更有效的做法是只查看 df2 索引的一小部分。您可以使用更强大的add函数（可以填充 NaN）来执行此操作：

df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)

虽然这里的语法不是很干...

为了比较一些 timeit 信息，我们可以看到 add 并没有显着变慢，并且两者都是对您的幼稚 for 循环的巨大改进：

In [11]: %%timeit
result = df1.total.copy()
for row_index, row in df2.iterrows():
    result[row_index] += row
100 loops, best of 3: 17.9 ms per loop

In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)
1000 loops, best of 3: 325 µs per loop

In [13]: %timeit df1.total[df2.index] += df2.total
1000 loops, best of 3: 283 µs per loop

这是一个有趣的问题（我可能会在稍后填写）这将是更快的相对大小，但在这种极端情况下肯定会有巨大的胜利......

要从这里拿走的东西：

如果您正在编写一个 for 循环（在 python 中）以加快速度，那么您做错了！:)

score 1 · Accepted Answer

我认为在这种特定情况下您的第二个可能会更快，因为您正在迭代较小的数据集（少量工作），然后仅访问较大数据集的少数组件（感谢 pandas 开发人员的有效操作）。

但是，使用该.add方法，pandas 必须查看两个索引的整体。

如果df1和df2长度相同，您的第一种方法需要 54 毫秒，但第二种方法需要 >2 分钟（在我的机器上，显然是 YMMV）。

python - Pandas 中 Series.add 函数的更快替代方案

3 回答 3

要从这里拿走的东西：

Related

Reference