python - 使用带有大表的循环的 python 性能问题

Question

我正在使用 python 和多个库（如 pandas 和 scipy）来准备数据，以便开始更深入的分析。例如，出于准备目的，我正在创建具有两个日期差异的新列。
我的代码提供了预期的结果，但速度很慢，所以我不能将它用于具有 80K 行的表。运行时间大约需要。80分钟的表只为这个简单的操作。

问题肯定和我的写作操作有关：

tableContent[6]['p_test_Duration'].iloc[x] = difference

此外，python 提供了一个警告：

日期差异的完整代码示例：

import time
from datetime import date, datetime

tableContent[6]['p_test_Duration'] = 0

#for x in range (0,len(tableContent[6]['p_test_Duration'])):
for x in range (0,1000):
    p_test_ZEIT_ANFANG = datetime.strptime(tableContent[6]['p_test_ZEIT_ANFANG'].iloc[x], '%Y-%m-%d %H:%M:%S')
    p_test_ZEIT_ENDE = datetime.strptime(tableContent[6]['p_test_ZEIT_ENDE'].iloc[x], '%Y-%m-%d %H:%M:%S')
    difference = p_test_ZEIT_ENDE - p_test_ZEIT_ANFANG

    tableContent[6]['p_test_Duration'].iloc[x] = difference

正确的结果表：

score 4 · Accepted Answer

去掉循环，将函数应用到整个系列。

ZEIT_ANFANG = tableContent[6]['p_test_ZEIT_ANFANG'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
ZEIT_ENDE = tableContent[6]['p_test_ZEIT_ENDE'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
tableContent[6]['p_test_Duration'] = ZEIT_ENDE - ZEIT_ANFANG

score 2 · Accepted Answer

pd.to_datetime您可以通过使用和避免apply不必要地使用来矢量化日期的转换。

tableContent[6]['p_test_Duration'] = (
    pd.to_datetime(tableContent[6]['p_test_ZEIT_ENDE']) -
    pd.to_datetime(tableContent[6]['p_test_ZEIT_ANFANG'])
)

此外，SettingWithCopy由于链式索引分配，您收到了警告

tableContent[6]['p_test_Duration'].iloc[x] = difference

如果您按照我建议的方式进行操作，则不必担心。

score 0 · Accepted Answer

其他答案很好，但我建议您一般避免使用链式索引。pandas 文档明确不鼓励链式索引，因为它要么产生不可靠的结果，要么速度很慢（由于多次调用 __getitem__）。假设您的数据框是多索引的，您可以替换：

tableContent[6]['p_test_Duration'].iloc[x] = difference

和：

tableContent.loc[x, (6, 'p_test_Duration')] = difference

您有时可以绕过这个问题，但为什么不学习将来最不可能引起问题的方法呢？

python - 使用带有大表的循环的 python 性能问题

此外，python 提供了一个警告：

日期差异的完整代码示例：

正确的结果表：

3 回答 3

Related

Reference