python - 处理 pandas 中的每月分箱数据

Question

我有一个数据集，我在 pandas 中分析，其中所有数据都是每月分箱的。数据源自 MySQL 数据库，其中所有日期的格式为“YYYY-MM-01”，例如，2013 年 10 月的所有行在月份列中都将包含“2013-10-01”。

我目前正在将数据读入 pandas（通过 MySQL 表的 .tsv 转储）

data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')

这一切都很好，除了我进行每月重新采样的任何后续分析总是使用月末约定表示日期（即 10 月的数据变为 '2013-10-31' 而不是 '2013- 10-01'），但这可能会导致不一致，原始数据的月份标记为“YYYY-MM-01”，而任何重新采样的数据的月份标记为“YYYY-MM-31”（或“-30 ' 或 '-28'，视情况而定）。

我的问题是：从一开始就可以将数据框中的所有日期转换为月末格式的最简单和/或最快的方法是什么？请记住，日期是多索引中的几个索引之一，而不是列。我认为我最好的选择是在我的 pd.read_table 调用中使用修改后的 date_parser，它总是将月份转换为月末约定，但我不确定如何处理它。

score 3 · Accepted Answer

像您正在做的那样阅读您的日期。

创建一些测试数据。我将日期设置为月初，但这没关系。

In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
                        index=date_range('20130101',periods=10,freq='MS'))

In [40]: df
Out[40]: 
                   A         B
2013-01-01 -0.553482  0.049128
2013-02-01  0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638  1.903388
2013-05-01 -0.087752  1.551916
2013-06-01  1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643  1.661696
2013-09-01  0.501061 -1.455171
2013-10-01  1.343630 -2.008060

无论是哪一天，都强制将它们转换为时间空间中的月末

In [41]: df.index = df.index.to_period().to_timestamp('M')

In [42]: df
Out[42]: 
                   A         B
2013-01-31 -0.553482  0.049128
2013-02-28  0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638  1.903388
2013-05-31 -0.087752  1.551916
2013-06-30  1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643  1.661696
2013-09-30  0.501061 -1.455171
2013-10-31  1.343630 -2.008060

回到起点

In [43]: df.index = df.index.to_period().to_timestamp('MS')

In [44]: df
Out[44]: 
                   A         B
2013-01-01 -0.553482  0.049128
2013-02-01  0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638  1.903388
2013-05-01 -0.087752  1.551916
2013-06-01  1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643  1.661696
2013-09-01  0.501061 -1.455171
2013-10-01  1.343630 -2.008060

您还可以将（和重新采样）用作句点

In [45]: df.index = df.index.to_period()

In [46]: df
Out[46]: 
                A         B
2013-01 -0.553482  0.049128
2013-02  0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638  1.903388
2013-05 -0.087752  1.551916
2013-06  1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643  1.661696
2013-09  0.501061 -1.455171
2013-10  1.343630 -2.008060

score 1 · Accepted Answer

使用 replace() 更改日期值。你可以得到一个月的最后一天使用

from datetime import date
import calendar

d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])

更新

我为熊猫添加了一些示例。

示例文件 date.csv

2013-01-01, 1
2013-02-01, 2

ipython 外壳日志。

In [27]: import pandas as pd

In [28]: from datetime import datetime, date

In [29]: import calendar

In [30]: def parse(dt):
             dt = datetime.strptime(dt, '%Y-%m-%d')
             dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
             return dt.date()
             ....:

In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)

In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)

In [33]: r
Out[33]:
         date  value
0  2013-01-31      1
1  2013-02-28      2

python - 处理 pandas 中的每月分箱数据

2 回答 2

Related

Reference