pandas - 将包含字符串列表的系列拆分为多列

Question

我正在使用 pandas 从 Twitter 数据集中执行一些字符串匹配。

我已导入推文的 CSV 并使用日期编制索引。然后我创建了一个包含文本匹配的新列：

In [1]:
import pandas as pd
indata = pd.read_csv('tweets.csv')
indata.index = pd.to_datetime(indata["Date"])
indata["matches"] = indata.Tweet.str.findall("rudd|abbott")
only_results = pd.Series(indata["matches"])
only_results.head(10)

Out[1]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

我最终想要的是一个按日/月分组的数据框，我可以将不同的搜索词绘制为列，然后进行绘图。

我在另一个 SO 答案（ https://stackoverflow.com/a/16637607/2034487 ）上遇到了看起来完美的解决方案，但是在尝试申请这个系列时，我遇到了一个例外：

In [2]: only_results.apply(lambda x: pd.Series(1,index=x)).fillna(0)
Out [2]: Exception - Traceback (most recent call last)
...
Exception: Reindexing only valid with uniquely valued Index objects

我真的希望能够应用数据框中的更改来应用和重新应用 groupby 条件并有效地执行绘图 - 并且很想了解更多关于 .apply() 方法如何工作的信息。

提前致谢。

成功回答后更新

问题在于我没有看到的“匹配”列中的重复项。我遍历该列以删除重复项，然后使用上面链接的@Jeff 的原始解决方案。这很成功，我现在可以在结果系列上使用 .groupby() 来查看每日、每小时等趋势。这是结果图的示例：

In [3]: successful_run = only_results.apply(lambda x: pd.Series(1,index=x)).fillna(0)
In [4]: successful_run.groupby([successful_run.index.day,successful_run.index.hour]).sum().plot()

Out [4]: <matplotlib.axes.AxesSubplot at 0x110b51650>

按天和小时分组的图

score 1 · Accepted Answer

你得到了一些重复的结果（例如，陆克文在一条推文中出现了不止一次），因此出现了异常（见下文）。

我认为计算出现次数而不是从 findall 中列出会更可取（熊猫数据结构并非旨在包含列表，尽管 str.findall 使用它们）。
我建议使用这样的东西：

In [1]: s = pd.Series(['aa', 'aba', 'b'])

In [2]: pd.DataFrame({key: s.str.count(key) for key in ['a', 'b']})
Out[2]: 
   a  b
0  2  0
1  2  1
2  0  1

注意（由于在前两行中发现重复的“a”而导致异常）：

In [3]: s.str.findall('a').apply(lambda x: pd.Series(1,index=x)).fillna(0)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects

score 1 · Accepted Answer

首先重置索引，然后使用您提到的解决方案：

In [28]: s
Out[28]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

In [29]: df = s.reset_index()

In [30]: df.join(df.matches.apply(lambda x: Series(1, index=x)).fillna(0))
Out[30]:
                 Date   matches  abbott  rudd
0 2013-08-06 16:03:17        []       0     0
1 2013-08-06 16:03:12        []       0     0
2 2013-08-06 16:03:10        []       0     0
3 2013-08-06 16:03:09        []       0     0
4 2013-08-06 16:03:08        []       0     0
5 2013-08-06 16:03:07        []       0     0
6 2013-08-06 16:03:07  [abbott]       1     0
7 2013-08-06 16:03:06        []       0     0
8 2013-08-06 16:03:02        []       0     0
9 2013-08-06 16:03:00    [rudd]       0     1

除非您对 a 有明确的用例DatetimeIndex（通常涉及某种类型的重新采样，并且没有重复），否则最好将日期放入列中，因为它比将其保留为索引更灵活，尤其是在所述索引有重复的情况下.

就apply方法而言，它对不同的对象做的事情略有不同。例如，DataFrame.apply()默认情况下将跨列应用传入的可调用对象，但您可以传递axis=1以沿行应用它。

Series.apply()将传入的可调用对象应用于Series实例的每个元素。对于@Jeff 提供的非常聪明的解决方案，发生的情况如下：

In [12]: s
Out[12]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

In [13]: pd.lib.map_infer(s.values, lambda x: Series(1, index=x)).tolist()
Out[13]:
[Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 abbott    1
dtype: int64,
 Series([], dtype: int64),
 Series([], dtype: int64),
 rudd    1
dtype: int64]

In [14]: pd.core.frame._to_arrays(_13, columns=None)
Out[14]:
(array([[ nan,  nan,  nan,  nan,  nan,  nan,   1.,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,   1.]]),
 Index([u'abbott', u'rudd'], dtype=object))

每个空Series的 inOut[13]都被赋予一个值，nan以指示在我们的任一列索引处都没有值。在这种情况下，该索引是Index([u'abbott', u'rudd'], dtype=object)。如果列索引处有值，则保留该值。

请记住，这些是用户通常不必担心的低级细节。我很好奇，所以我跟着代码的踪迹。

pandas - 将包含字符串列表的系列拆分为多列

2 回答 2

Related

Reference