python - .json 扩展文件 + 时间戳 + Pandas + Python

Question

我有一个 .json 文件扩展名 (logs.json) 发送给我，其中包含以下数据（我只显示其中的一部分，因为有超过 2,000 个条目）：

[“2012-03-01T00:05:55+00:00”、“2012-03-01T00:06:23+00:00”、“2012-03-01T00:06:52+00:00”、“ 2012-03-01T00:11:23+00:00”、“2012-03-01T00:12:47+00:00”、“2012-03-01T00:12:54+00:00”、“2012- 03-01T00:16:14+00:00”、“2012-03-01T00:17:31+00:00”、“2012-03-01T00:21:23+00:00”、“2012-03- 01T00:21:26+00:00”、“2012-03-01T00:22:25+00:00”、“2012-03-01T00:28:24+00:00”、“2012-03-01T00: 31:21+00:00”、“2012-03-01T00:32:20+00:00”、“2012-03-01T00:33:32+00:00”、“2012-03-01T00:35: 21+00:00”、“2012-03-01T00:38:14+00:00”、“2012-03-01T00:39:24+00:00”、“2012-03-01T00:43:12+ 00:00”、“2012-03-01T00:46:13+00:00”、“2012-03-01T00:46:31+00:00”、“2012-03-01T00:48:03+00: 00",“2012-03-01T00:49:34+00:00”、“2012-03-01T00:49:54+00:00”、“2012-03-01T00:55:19+00:00”、“2012 -03-01T00:56:27+00:00"、"2012-03-01T00:56:32+00:00"]

使用熊猫，我做了：

import pandas as pd
logs = pd.read_json('logs.json')
logs.head()

我得到以下信息：

                           0
0  2012-03-01T00:05:55+00:00
1  2012-03-01T00:06:23+00:00
2  2012-03-01T00:06:52+00:00
3  2012-03-01T00:11:23+00:00
4  2012-03-01T00:12:47+00:00

[5 rows x 1 columns]

然后，为了分配正确的数据类型，包括 UTC 区域，我这样做：

logs = pd.to_datetime(logs[0], utc=True)
logs.head()

并得到：

0   2012-03-01 00:05:55
1   2012-03-01 00:06:23
2   2012-03-01 00:06:52
3   2012-03-01 00:11:23
4   2012-03-01 00:12:47
Name: 0, dtype: datetime64[ns]

以下是我的问题：

上面的代码是否正确以正确格式获取我的数据？
我的 UTC 时区去了哪里？如果我想创建一个具有相应 PST 时间的列并以数据框格式将其添加到此数据集中怎么办？
我似乎记得为了获得每天/每周或每年的计数，我需要在某处添加 .day、.week 或 .year（logs.day？），但我无法弄清楚，我猜这是因为我的数据的当前形状。我如何获得每天的计数？星期？年？这样我就可以绘制数据？我将如何绘制数据？

对于从 R 过渡到使用 Python 进行数据分析的人来说，这些简单的问题似乎太难了！我希望你们能帮忙！

score 3 · Accepted Answer

我认为这里的 tz 处理可能存在错误，这当然有可能默认转换（我很惊讶它不是，我怀疑这是因为它只是一个列表）。

In [21]: s = pd.read_json(js, convert_dates=[0], typ='Series')  # more honestly this is a Series

In [22]: s.head()
Out[22]:
0   2012-03-01 00:05:55
1   2012-03-01 00:06:23
2   2012-03-01 00:06:52
3   2012-03-01 00:11:23
4   2012-03-01 00:12:47
dtype: datetime64[ns]

要获得年、月等的计数。我可能会使用 DatetimeIndex （目前类似日期的列没有年/月等方法，尽管我认为它们（c|sh）应该）：

In [23]: dti = pd.DatetimeIndex(s)

In [24]: s.groupby(dti.year).size()
Out[24]:
2012    27
dtype: int64

In [25]: s.groupby(dti.month).size()
Out[25]:
3    27
dtype: int64

也许将数据视为 TimeSeries 更有意义：

In [31]: ts = pd.Series(1, dti)

In [32]: ts.head()
Out[32]:
2012-03-01 00:05:55    1
2012-03-01 00:06:23    1
2012-03-01 00:06:52    1
2012-03-01 00:11:23    1
2012-03-01 00:12:47    1
dtype: int64

这样您就可以使用重采样：

In [33]: ts.resample('M', how='sum')
Out[33]:
2012-03-31    27
Freq: M, dtype: int64

python - .json 扩展文件 + 时间戳 + Pandas + Python

1 回答 1

Related

Reference