python - 从大型元组/行列表中有效地构建 Pandas DataFrame

Question

我继承了一个以 Stata .dta 格式保存的数据文件。我可以用scikits.statsmodels genfromdta()函数加载它。这会将我的数据放入一维 NumPy 数组中，其中每个条目是一行数据，存储在 24 元组中。

In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013

In [3]: type(initialload)
Out[3]: numpy.ndarray

In [4]: initialload.shape
Out[4]: (4809584,)

In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)

我很好奇是否有一种有效的方法可以将其安排到 Pandas DataFrame 中。根据我的阅读，逐行构建 DataFrame 似乎效率很低……但是我有什么选择？

我写了一个非常慢的第一遍，它只是将每个元组作为单行 DataFrame 读取并附加它。只是想知道是否还有其他更好的方法。

score 21 · Accepted Answer

21

pandas.DataFrame(initialload, columns=list_of_column_names)

于 2012-07-10T14:44:48.687 回答

score 3 · Accepted Answer

Version 0.12 of pandas onwards should support loading Stata format directly (Reference).

From the documentation:

The top-level function read_stata will read a dta format file and return a DataFrame: The class StataReader will read the header of the given dta file at initialization. Its method data() will read the observations, converting them to a DataFrame which is returned:

 pd.read_stata('stata.dta')

python - 从大型元组/行列表中有效地构建 Pandas DataFrame

2 回答 2

Related

Reference