performance - 数据框列“重命名”和“删除”的 Pandas 性能问题

Question

下面是一个函数的 line_profiler 记录：

Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s

File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1068                                           @profile
  1069                                           def _rpt_join(dfa, dfb, join_type='inner'):
  1070                                               ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
  1071                                                   'join_type' can be 'inner' or 'outer'
  1072                                               '''
  1073                                           
  1074         2           56     28.0      0.0      try:    # ('STK_ID','RPT_Date') are normal column
  1075         2      2936668 1468334.0     43.7          rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
  1076                                               except: # ('STK_ID','RPT_Date') are index
  1077                                                   rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
  1078                                                   
  1079                                           
  1080         2           81     40.5      0.0      try: # handle 'STK_Name
  1081         2       426472 213236.0      6.3          name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
  1082                                                   
  1083                                                   
  1084         2       900584 450292.0     13.4          nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
  1085                                                   
  1086         2      1138140 569070.0     16.9          rst.STK_Name_x = nameseries
  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)
  1089                                               except:
  1090                                                   pass
  1091                                           
  1092         2           94     47.0      0.0      return rst

让我吃惊的是这两行：

  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)

为什么一个简单的数据框列"rename"和"drop"操作会花费那么多时间（8.9% + 10.7%）？无论如何，该"merge"操作只花费 43.7% ，并且“rename”/“drop”看起来不像是计算密集型操作。如何改进它？

performance - 数据框列“重命名”和“删除”的 Pandas 性能问题

0 回答 0

Related

Reference