python - 熊猫数据框的有效矩阵化

Question

我的第一个 StackOverflow 问题。

所以我有一个看起来像这样的 Pandas DataFrame：

String1 String2 String3 value
word1 word2 word3 5.6
word4 word5 word6 123.4
...

这种 DataFrame 来自基于大量文本的非常长的处理链。（附带说明一下，我正在接近内存限制，现在正在考虑使用 HDFStores。）

现在，我想基于将此表转换为（稀疏？）面板或其他一些用 0 填充空白的有效数据结构来进行线性代数运算。也就是说，我想创建一个表，其行是 String3s，其列是 String1 x String2 对，然后对行进行线性代数运算。但是，我也希望能够对任何其他列执行相同的操作——即，将 String1 作为行，并从 String2 x String3 中创建列。

我一直在尝试使用面板和数据透视表，但它们似乎不太正确，而且它们经常会溢出内存。

一般而言，使用 Pandas 或 Python（2.7）执行此操作的正确方法是什么？

编辑添加此示例：

输出表将如下所示：

String1String2 (word1,word2) (word1,word5) (word4,word2) (word4,word5) ...
String3
word3 5.6 0 0 0 ...
word6 0 0 0 123.4 ...

列数基本上是|String1| x |字符串2|。或者，作为列的 String3 和作为行的 String1String2 也可以，因为我可以对列系列执行操作。

进一步编辑以添加内存问题：

In [1]: import pandas as pd

In [2]: A = pd.load("file.df")

In [3]: A 
Out[3]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18506532 entries, 0 to 18506531
Columns: 4 entries, 0 to value
dtypes: float64(1), object(3)

In [4]: B = A[A[1] == 'xyz']

In [5]: C = B.pivot_table('value', [1,2], 0)

它在 reshape.pyc 的第 160 行出现 MemoryError 崩溃。这是熊猫的 0.11.0 版本。

score 1 · Accepted Answer

您可以使用 pivot_table 执行此操作：

In [11]: res = df.pivot_table('value', 'String3', ['String1', 'String2'])

In [12]: res
Out[12]: 
String1  word1  word4
String2  word2  word5
String3              
word3      5.6    NaN
word6      NaN  123.4

这个结果可能就足够了，但如果你想要空白列，你可以使用 itertools.product。

In [13]: from itertools import product

In [14]: res = res.reindex(columns=list(product(df['String1'], df['String2'])))

In [15]: res.columns.names = ['String1', 'String2']

In [16]: res
Out[16]: 
String1  word1         word4       
String2  word2  word5  word2  word5
String3                            
word3      5.6    NaN    NaN    NaN
word6      NaN    NaN    NaN  123.4

并用 0 填空：

In [17]: res.fillna(0)
Out[17]: 
String1  word1         word4       
String2  word2  word5  word2  word5
String3                            
word3      5.6      0      0    0.0
word6      0.0      0      0  123.4

注意：在 0.13cartesian_product中将在pandas.tools.util.

python - 熊猫数据框的有效矩阵化

1 回答 1

Related

Reference