162

假设我有一个包含 10 个键值对的字典。每个条目都包含一个 numpy 数组。但是,数组的长度对于所有这些都不相同。

如何创建每列包含不同条目的数据框?

当我尝试:

pd.DataFrame(my_dict)

我得到:

ValueError: arrays must all be the same length

有什么办法可以克服吗?我很高兴 Pandas 使用NaN这些列填充较短的条目。

4

8 回答 8

185

在 Python 3.x 中:

import pandas as pd
import numpy as np

d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
    
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))

Out[7]: 
    A  B
0   1  1
1   2  2
2 NaN  3
3 NaN  4

在 Python 2.x 中:

替换d.items()d.iteritems().

于 2013-11-01T22:27:02.037 回答
105

这是一个简单的方法:

In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]: 
   0  1   2   3
A  1  2 NaN NaN
B  1  2   3   4
In[23]: df.transpose()
Out[23]: 
    A  B
0   1  1
1   2  2
2 NaN  3
3 NaN  4
于 2014-08-09T10:06:16.477 回答
21

一种整理语法的方法,但仍然与其他答案基本相同,如下所示:

>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}

>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })

>>> dict_df

   one  2    3
0  1.0  4  8.0
1  2.0  5  NaN
2  3.0  6  NaN
3  NaN  7  NaN

列表也存在类似的语法:

>>> mylist = [ [1,2,3], [4,5], 6 ]

>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])

>>> list_df

     0    1    2
0  1.0  2.0  3.0
1  4.0  5.0  NaN
2  6.0  NaN  NaN

列表的另一种语法是:

>>> mylist = [ [1,2,3], [4,5], 6 ]

>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })

>>> list_df

   0    1    2
0  1  4.0  6.0
1  2  5.0  NaN
2  3  NaN  NaN

您可能还必须转置结果和/或更改列数据类型(浮点数、整数等)。

于 2018-05-03T23:00:58.967 回答
5

使用pandas.DataFramepandas.concat

  • 下面的代码将创建一个listwith DataFramespandas.DataFrame从一个dict不均匀arrays,然后concat在一个列表理解中将数组一起创建。
    • 这是一种创建长度不相等的DataFrameof的方法。arrays
    • 对于相等的长度arrays,使用df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np


# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)

data = {'x1': x1, 'x2': x2, 'x3': x3}

# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)

使用pandas.DataFrameitertools.zip_longest

  • 对于长度不均匀的迭代,zip_longestfillvalue.
  • zip 生成器需要解包,因为DataFrame构造函数不会对其进行解包。
from itertools import zip_longest

# zip all the values together
zl = list(zip_longest(*data.values()))

# create dataframe
df = pd.DataFrame(zl, columns=data.keys())

阴谋

df.plot(marker='o', figsize=[10, 5])

在此处输入图像描述

数据框

           x1         x2         x3
0   232.06900  235.92577  173.19476
1   176.94349  209.26802  186.09590
2   194.18474  168.36006  194.36712
3   196.55705  238.79899  218.33316
4   249.25695  167.91326  191.62559
5   215.25377  214.85430  230.95119
6   232.68784  240.30358  196.72593
7   212.43409  201.15896  187.96484
8   188.97014  187.59007  164.78436
9   196.82937  252.67682  196.47132
10        NaN  223.32571  208.43823
11        NaN  209.50658  209.83761
12        NaN  215.27461  249.06087
13        NaN  210.52486  158.65781
14        NaN  193.53504  199.10456
15        NaN        NaN  186.19700
16        NaN        NaN  223.02479
17        NaN        NaN  185.68525
18        NaN        NaN  213.41414
19        NaN        NaN  271.75376
于 2020-09-10T00:07:26.867 回答
5

虽然这并不能直接回答 OP 的问题。当我有不相等的数组并且我想分享时,我发现这是一个很好的解决方案:

来自熊猫文档

In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [32]: df = DataFrame(d)

In [33]: df
Out[33]: 
   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4
于 2015-09-03T18:35:27.213 回答
3

Both the following lines work perfectly :

pd.DataFrame.from_dict(df, orient='index').transpose() #A

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)

But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).

于 2019-03-19T09:26:18.870 回答
3

您还可以与对象列表pd.concat一起使用:axis=1pd.Series

import pandas as pd, numpy as np

d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}

res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)

print(res)

     A  B
0  1.0  1
1  2.0  2
2  NaN  3
3  NaN  4
于 2018-09-12T20:16:07.330 回答
0

如果您不希望它显示NaN并且您有两个特定的长度,则在每个剩余的单元格中添加一个“空格”也可以。

import pandas

long = [6, 4, 7, 3]
short = [5, 6]

for n in range(len(long) - len(short)):
    short.append(' ')

df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()

   A  B
0  6  5
1  4  6
2  7   
3  3   

如果条目的长度超过 2 个,建议制作一个使用类似方法的函数。

于 2019-08-08T16:19:46.450 回答