我是 Pandas 和一般编程的新手,所以任何帮助都将不胜感激。
我很难将从 hdf5 文件加载的 Pandas 数据框中的一列数据转换为日期时间对象。数据太大而无法使用文本文件,因此我使用以下代码将其转换为 hdf5 文件:
# get text file from zip file and unzip
file = urllib.request.urlretrieve(file, dir)
z = zipfile.ZipFile(dir)
data = z.open(z.namelist()[0])
# column names from text file
colnames = ['Patent#','App#','Small','Filing Date','Issue Date', 'Event Date', 'Event Code']
# load the data in chunks and concat into single DataFrame
mfees = pd.read_table(data, index_col=0, sep='\s+', header = None, names = colnames, chunksize=1000, iterator=True)
df = pd.concat([chunk for chunk in mfees], ignore_index=False)
# close files
z.close()
data.close()
# convert to hdf5 file
data = data.to_hdf('mfees.h5','raw_data',format='table')
在此之后,我的数据采用以下格式:
data['Filing Date']
输出:
Patent#
4287053 19801222
4287053 19801222
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
Name: Filing Date, Length: 11887679, dtype: int64
但是,当我使用 to_datetime 函数时,我得到以下信息:
data['Filing Date'] = pd.to_datetime(data['Filing Date'])
data['Filing Date']
输出:
Patent#
4287053 1970-01-01 00:00:00.019801222
4287053 1970-01-01 00:00:00.019801222
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4291808 1970-01-01 00:00:00.019801212
4291808 1970-01-01 00:00:00.019801212
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
Name: Filing Date, Length: 11887679, dtype: datetime64[ns]
我不确定为什么要获得 datetime 对象的上述输出。我可以做些什么来纠正这个问题并将日期正确转换为日期时间对象?谢谢!