python - 通过一次追加一行来创建 Pandas 数据框

Question

我知道 Pandas 旨在加载一个完全填充的DataFrame，但我需要创建一个空的 DataFrame 然后逐一添加行。做这个的最好方式是什么？

我成功地创建了一个空的DataFrame：

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

然后我可以添加一个新行并用以下内容填充一个字段：

res = res.set_value(len(res), 'qty1', 10.0)

它有效，但看起来很奇怪：-/（添加字符串值失败。）

如何向我的 DataFrame 添加新行（具有不同的列类型）？

score 820 · Accepted Answer

您可以使用df.loc[i]，其中带有索引的i行将是您在数据框中指定的行。

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

score 687 · Accepted Answer

如果您可以预先获取数据框的所有数据，则有一种比附加到数据框更快的方法：

创建一个字典列表，其中每个字典对应一个输入数据行。
从此列表创建一个数据框。

我有一个类似的任务，逐行附加到数据帧需要 30 分钟，并从几秒钟内完成的字典列表创建一个数据帧。

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)

score 348 · Accepted Answer

在向数据框添加大量行的情况下，我对性能感兴趣。所以我尝试了四种最流行的方法并检查了它们的速度。

表现

使用 .append （NPE 的回答）
使用 .loc（弗雷德的回答）
使用 .loc 进行预分配（FooBar 的回答）
最后使用 dict 并创建 DataFrame（ShikharDua 的回答）

运行时结果（以秒为单位）：

方法	1000 行	5000 行	10 000 行
。附加	0.69	3.39	6.78
.loc 没有 prealloc	0.74	3.90	8.35
.loc 与 prealloc	0.24	2.58	8.70
听写	0.012	0.046	0.084

所以我自己通过字典使用加法。

代码：

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

PS：我相信我的实现并不完美，也许可以做一些优化。

score 327 · Accepted Answer

您可以使用pandas.concat()或DataFrame.append()。有关详细信息和示例，请参阅合并、连接和连接。

score 208 · Accepted Answer

NEVER grow a DataFrame!

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?

Here are the most important reasons, taken from my post here.

It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.

This is The Right Way™ to accumulate your data

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

These options are horrible

append or concat inside a loop

append and concat aren't inherently bad in isolation. The problem starts when you iteratively call them inside a loop - this results in quadratic memory usage.

# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)  
    # This is equally bad:
    # df = pd.concat(
    #       [df, pd.Series({'A': i, 'B': b, 'C': c})], 
    #       ignore_index=True)

Empty DataFrame of NaNs

Never create a DataFrame of NaNs as the columns are initialized with object (slow, un-vectorizable dtype).

# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

Benchmarking code for reference.

It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

score 124 · Accepted Answer

如果您事先知道条目数，则应通过提供索引来预先分配空间（以不同答案中的数据示例为例）：

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )

# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]: 
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

速度比较

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

并且 - 从评论中 - 大小为 6000，速度差异变得更大：

增加数组的大小（12）和行数（500）使速度差异更加显着：313ms vs 2.29s

score 92 · Accepted Answer

mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row

score 79 · Accepted Answer

ignore_index您可以使用该选项将单行附加为字典。

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black

score 76 · Accepted Answer

为了有效地追加，请参阅如何向 pandas 数据框添加额外的行和设置放大。

loc/ix通过不存在的键索引数据添加行。例如：

In [1]: se = pd.Series([1,2,3])

In [2]: se
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: se[5] = 5.

In [4]: se
Out[4]:
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

或者：

In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....:

In [2]: dfi
Out[2]:
   A  B
0  0  1
1  2  3
2  4  5

In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [4]: dfi
Out[4]:
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
In [5]: dfi.loc[3] = 5

In [6]: dfi
Out[6]:
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

score 48 · Accepted Answer

为了 Pythonic 方式：

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())

   lib  qty1  qty2
0  NaN  10.0   NaN

score 34 · Accepted Answer

您还可以建立列表列表并将其转换为数据框 -

import pandas as pd

columns = ['i','double','square']
rows = []

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

给予

score 18 · Accepted Answer

我想出了一个简单而好方法：

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

请注意评论中提到的性能警告。

score 15 · Accepted Answer

这不是对 OP 问题的答案，而是一个玩具示例来说明ShikharDua 的答案，我发现它非常有用。

虽然这个片段是微不足道的，但在实际数据中，我有 1,000 行和许多列，我希望能够按不同的列进行分组，然后对多个目标列执行下面的统计信息。因此，有一种可靠的方法来一次构建一行数据框是非常方便的。谢谢ShikharDua！

import pandas as pd

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

score 13 · Accepted Answer

13

If you always want to add a new row at the end, use this:

df.loc[len(df)] = ['name5', 9, 0]

于 2021-03-06T13:53:27.903 回答

score 11 · Accepted Answer

You can use a generator object to create a Dataframe, which will be more memory efficient over the list.

num = 10

# Generator function to generate generator object
def numgen_func(num):
    for i in range(num):
        yield ('name_{}'.format(i), (i*i), (i*i*i))

# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )

df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))

To add raw to existing DataFrame you can use append method.

df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400  }])

score 9 · Accepted Answer

创建一个新记录（数据框）并添加到old_data_frame。

传递值列表和相应的列名以创建new_record (data_frame)：

new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])

old_data_frame = pd.concat([old_data_frame, new_record])

score 8 · Accepted Answer

这是在 Pandas 中添加/追加一行的方法DataFrame：

def add_row(df, row):
    df.loc[-1] = row
    df.index = df.index + 1
    return df.sort_index()

add_row(df, [1,2,3])

它可用于在空的或填充的 Pandas DataFrame 中插入/追加一行。

score 8 · Accepted Answer

Instead of a list of dictionaries as in ShikharDua's answer, we can also represent our table as a dictionary of lists, where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.

For c columns and n rows, this uses one dictionary and c lists, versus one list and n dictionaries. The list-of-dictionaries method has each dictionary storing all keys and requires creating a new dictionary for every row. Here we only append to lists, which is constant time and theoretically very fast.

# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}

# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")

# At the end, construct our DataFrame
df = pd.DataFrame(data)
#   Animal  Color
# 0    cow   blue
# 1  horse    red
# 2  mouse  black

score 5 · Accepted Answer

If you want to add a row at the end, append it as a list:

valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)

score 4 · Accepted Answer

另一种方法（可能不是很好）：

# add a row
def add_row(df, row):
    colnames = list(df.columns)
    ncol = len(colnames)
    assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
    return df.append(pd.DataFrame([row], columns=colnames))

您还可以像这样增强 DataFrame 类：

import pandas as pd
def add_row(self, row):
    self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row

score 4 · Accepted Answer

All you need is loc[df.shape[0]] or loc[len(df)]

# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]

or

df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]

score 4 · Accepted Answer

initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}

df = pd.DataFrame(initial_data)

df

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4

val_1 = [10]
val_2 = [14]
val_3 = [20]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4
0    10    14    20

You can use a for loop to iterate through values or can add arrays of values.

val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]

df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))

lib    qty1    qty2
0    1    1    1
1    2    2    2
2    3    3    3
3    4    4    4
0    10    14    20
1    11    15    21
2    12    16    22
3    13    17    43

score 2 · Accepted Answer

让它变得简单。通过将列表作为输入，该列表将作为一行附加到数据框中：

import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
    res_list = list(map(int, input().split()))
    res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)

score 2 · Accepted Answer

You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).

So, I input the data for a new row in a duct() and index in a list.

new_dict = {put input for new row here}
new_list = [put your index here]

new_df = pd.DataFrame(data=new_dict, index=new_list)

df = pd.concat([existing_df, new_df])

score 1 · Accepted Answer

1

于 2019-08-22T12:39:32.367 回答

score 1 · Accepted Answer

If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end. It seems to be even faster than converting a list of dicts.

import pandas as pd
import numpy as np
from string import ascii_uppercase

startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
    npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)

score 1 · Accepted Answer

If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:

df.loc[len(df)] = new_list

If you want to add a new data frame new_df under data frame df, then you can use:

df.append(new_df)

score 0 · Accepted Answer

pandas.DataFrame.append

DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'

Code

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)

With ignore_index set to True:

df.append(df2, ignore_index=True)

score 0 · Accepted Answer

Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.

That idea makes me to write the below code.

df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns:   # Here df.columns gives us the main dictionary key
    df2[x][101] = values[i]   # Here the 101 is our index number. It is also the key of the sub dictionary
    i += 1

score 0 · Accepted Answer

This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.

import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
    for y in fruits:
#         print(x, y)
        dict1 = dict([('x',x),('y',y)])
#         print(f'dict 1 {dict1}')
        feat_list.append(dict1)
#         print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

score -3 · Accepted Answer

这将负责将项目添加到空 DataFrame。问题是df.index.max() == nan对于第一个索引：

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])

df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]

python - 通过一次追加一行来创建 Pandas 数据框

31 回答 31

表现

NEVER grow a DataFrame!

This is The Right Way™ to accumulate your data

These options are horrible

The Proof is in the Pudding

Code

Related

Reference