python - 添加到 Pandas DataFrame 时出现 datetime64 错误

Question

我对 Python + Numpy + Pandas 有疑问。

我有一个时间戳列表，精度为毫秒，编码为字符串。然后我将它们四舍五入到 10 毫秒的分辨率，这很顺利。当我将四舍五入的时间戳作为新列添加到 DataFrame 时，就会出现错误 - datetime64 对象的值被完全破坏。

难道我做错了什么？还是那是 Pandas/NumPy 错误？

顺便说一句，我怀疑这个错误只出现在 Windows 上 - 我昨天在 Mac 上尝试相同的代码时没有注意到它（尚未验证这一点）。

import numpy
import pandas as pd

# We create a list of strings. 
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
                '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored, 
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time 

# Then we create a Pandas DataFrame and assign the time array as a 
# column to it. The datetime64 is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df

我得到的输出：

    one two wrong_time
a   1.0 1.0 1974-10-01 18:11:07.585
b   2.0 2.0 1974-10-01 18:11:07.625
c   3.0 3.0 1974-10-01 18:11:07.645
d   NaN 4.0 1974-10-01 18:11:07.665

pd.show_versions() 的输出：

INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

score 1 · Accepted Answer

在我看来，这是错误，因为显然在内部numpy.datetime64被强制转换为Timestamps 。

对我来说作品使用to_datetime：

df = df.assign(wrong_time=pd.to_datetime(rounded_time))
print (df)
   one  two              wrong_time
a  1.0  1.0 2017-06-30 13:51:15.850
b  2.0  2.0 2017-06-30 13:51:16.250
c  3.0  3.0 2017-06-30 13:51:16.450
d  NaN  4.0 2017-06-30 13:51:16.650

另一种解决方案是ns：

df = df.assign(wrong_time=rounded_time.astype('datetime64[ns]'))
print (df)
   one  two              wrong_time
a  1.0  1.0 2017-06-30 13:51:15.850
b  2.0  2.0 2017-06-30 13:51:16.250
c  3.0  3.0 2017-06-30 13:51:16.450
d  NaN  4.0 2017-06-30 13:51:16.650

score 0 · Accepted Answer

我在 Pandas Git 存储库中打开了一个问题。并从 Jeff Reback 那里得到了一个建议的解决方案：我们没有创建奇怪的 10 毫秒 datetime64 对象，而是使用 floor() 函数简单地对时间戳进行四舍五入：

In [16]: # We create a list of strings. 
...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
...:                 '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']

In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)

来自https://github.com/pandas-dev/pandas/issues/17183的解决方案

python - 添加到 Pandas DataFrame 时出现 datetime64 错误

2 回答 2

Related

Reference