python - 当值不尊重模式时，如何将 Pandas 列转换为日期类型？

Question

我有闲置的数据框：

    Timestamp           real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   17FEB20:24:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

如何将列转换为datetime64？

有两件事让我变得充满挑战：

列Timestamp, index4具有值: 17FEB20:24:00:00，这似乎不是有效的日期时间（尽管它是由 SAS 程序输出的......）。
该列real time不采用模式，并且似乎无法通过date_parser.

这就是我试图解决的第一列（Timestamp）：

data['Timestamp'] = pd.to_datetime(
    data['Timestamp'],
    format='%d%b%y:%H:%M:%S')

但是由于索引 4 ( 17FEB20:24:00:00) 的值，我得到： ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match)。如果我删除这条线，它确实有效，但我必须找到一种方法来解决它，因为我的数据集有数千行，我不能简单地忽略它们。也许有办法将其转换为第二天的零小时？

这是一个片段代码，用于创建上面的 dataFrame 示例，以获得一些时间来解决答案（如果需要）：

data = pd.DataFrame({
    'Timestamp':[
        '17FEB20:23:59:50',
        '17FEB20:23:59:55',
        '17FEB20:23:59:57',
        '17FEB20:23:59:57',
        '17FEB20:24:00:00',
        '18FEB20:00:00:00',
        '18FEB20:00:00:02',
        '18FEB20:00:00:02'],
    'real time': [
        '0.003',
        '0.003',
        '0.012',
        '02:54.8',
        '0.03',
        '0',
        '54.211',
        '0.051',
        ]})

感谢你的帮助！

score 1 · Accepted Answer

如果您的数据不是太大，您可能需要考虑遍历数据框。你可以做这样的事情。

for index, row in data.iterrows():
    if row['Timestamp'][8:10] == '24':
        date = (pd.to_datetime(row['Timestamp'][:7]).date() + pd.DateOffset(1)).strftime('%d%b%y').upper()
        data.loc[index, 'Timestamp'] = date + ':00:00:00'

这就是结果。

        Timestamp      real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   18FEB20:00:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

score 0 · Accepted Answer

以下是我的解决方法：

对于专栏Timestamp，我使用了这个回复（感谢@merit_2 在第一条评论中分享它）。
对于 column real time，我使用一些条件进行解析。

这是代码：

import os
import pandas as pd
from datetime import timedelta

# Parsing "real time" column:

## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]

## apply mask over all '00:00:00.000'
placeholders = {
    1: '00:00:00.00',
    2: '00:00:00.0',
    3: '00:00:00.',
    4: '00:00:00',
    5: '00:00:0',
    6: '00:00:',
    7: '00:00',
    8: '00:0',
    9: '00:',
    10:'00',
    11:'0'}

for cond_len in placeholders:
    condition = data['real time'].str.len() == cond_len
    data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']

# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)

# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')

# check results:
display(data)
display(data.dtypes)

这是输出：

    Timestamp           real time
0   2020-02-17 23:59:50 1900-01-01 00:00:00.003
1   2020-02-17 23:59:55 1900-01-01 00:00:00.003
2   2020-02-17 23:59:57 1900-01-01 00:00:00.012
3   2020-02-17 23:59:57 1900-01-01 00:02:54.800
4   2020-02-18 00:00:00 1900-01-01 00:00:00.030
5   2020-02-18 00:00:00 1900-01-01 00:00:00.000
6   2020-02-18 00:00:02 1900-01-01 00:00:54.211
7   2020-02-18 00:00:02 1900-01-01 00:00:00.051

Timestamp    datetime64[ns]
real time    datetime64[ns]

也许有一个聪明的方法可以做到这一点，但现在它适合。

python - 当值不尊重模式时，如何将 Pandas 列转换为日期类型？

2 回答 2

Related

Reference