python - 使用 Start_Date 和 End_Date 绘制 Pandas 数据帧的计数

Question

我正在尝试plot各种daily follower count. twitter handles结果类似于您在下面看到的内容，但可以通过 1 个以上的 twitter 句柄过滤：

通常，我会通过简单地将一个从 Twitter 拉取的新数据集附加到原始表中来做到这一点，其中包含拉取日志的日期。但是，这将使我在短短几天内就完成了一百万行。而且它不允许我清楚地看到用户何时下线。

作为一个alternative，在从 Twitter 中提取我的数据之后，我的结构pandas dataframe是这样的：

Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017

在哪里：

Handles:是我拉追随者的帐户
Follower_ID:用户是否跟随句柄

因此，例如，如果我是Follower_ID 100，我可以同时关注handle x和handle y

我想知道准备数据（pivot, clean through a function, groupby）的最佳方法是什么，以便可以相应地绘制它。有任何想法吗？

score 1 · Accepted Answer

我最终使用iterrows了一种幼稚的方法，因此可能有一种更有效的方法来利用熊猫重塑等。但我的想法是制作一个函数来接收你的数据框和你想要绘制的句柄，然后返回具有该句柄的每日关注者计数的另一个数据框。为此，函数

仅将 df 过滤到所需的句柄，
采用每个日期范围（例如，21/04/2017 到 29/05/2017），
把它变成熊猫date_range，然后
将所有日期放在一个列表中。

那时，collections.Counter在单个列表上是一种按天计算结果的简单方法。

一个注意事项是 null End_Dates 应该合并到您想要在图表上的任何结束日期。当我处理数据时，我称之为max_date。总而言之：

from io import StringIO
from collections import Counter
import pandas as pd

def get_counts(df, handle):
    """Inputs: your dataframe and the handle
    you want to plot.

    Returns a dataframe of daily follower counts.
    """

    # filters the df to the desired handle only
    df_handle = df[df['Handles'] == handle]

    all_dates = []

    for _, row in df_handle.iterrows():
        # Take each date range (for example, 21/04/2017 to 29/05/2017),
        # turn that into a pandas `date_range`, and
        # put all the dates in a single list
        all_dates.extend(pd.date_range(row['Start_Date'],
                                       row['End_Date']) \
                           .tolist())

    counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
                         .rename(columns={0: handle}) \
                         .sort_index()

    return counts

这就是功能。现在阅读和争论你的数据......

data = StringIO("""Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017""")

df = pd.read_csv(data, delim_whitespace=True)

# fill in missing end dates
max_date = pd.Timestamp('2017-06-30') 
df['End_Date'].fillna(max_date, inplace=True)

# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])

print(get_counts(df, 'y'))

最后一行打印这个句柄y：

            y
2017-06-14  1
2017-06-15  1
2017-06-16  2
2017-06-17  2
2017-06-18  2
2017-06-19  2
2017-06-20  2
2017-06-21  2
2017-06-22  2
2017-06-23  2
2017-06-24  2
2017-06-25  2
2017-06-26  2
2017-06-27  2
2017-06-28  2
2017-06-29  1
2017-06-30  1

你可以用你喜欢的包来绘制这个数据框。

python - 使用 Start_Date 和 End_Date 绘制 Pandas 数据帧的计数

1 回答 1

Related

Reference