在向数据框添加大量行的情况下,我对性能感兴趣。所以我尝试了四种最流行的方法并检查了它们的速度。
- 使用 .append (NPE 的回答)
- 使用 .loc(弗雷德的回答)
- 使用 .loc 进行预分配(FooBar 的回答)
- 最后使用 dict 并创建 DataFrame(ShikharDua 的回答)
运行时结果(以秒为单位):
方法 |
1000 行 |
5000 行 |
10 000 行 |
。附加 |
0.69 |
3.39 |
6.78 |
.loc 没有 prealloc |
0.74 |
3.90 |
8.35 |
.loc 与 prealloc |
0.24 |
2.58 |
8.70 |
听写 |
0.012 |
0.046 |
0.084 |
所以我自己通过字典使用加法。
代码:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
PS:我相信我的实现并不完美,也许可以做一些优化。