3

在 python 中,我的数据看起来像这样,有 500.000 行:

时间计数

1-1-1900 10:41:00 1

3-1-1900 09:54:00 1

4-1-1900 15:45:00 1

5-1-1900 18:41:00 1

4-1-1900 15:45:00 1

我想在这样的季度中创建一个带有垃圾箱的新列:

箱数

9:00-9:15 2

9:15-9:30 4

9:30-9:45 4

10:00-10:15 4

我知道你是怎么做垃圾箱的,但是时间戳给我带来了麻烦。有人可以帮我吗?已经谢谢你了!

4

3 回答 3

2

我知道已经晚了。但迟到总比没有好。我也遇到了类似的要求,并通过使用 [pandas][1] 库完成

  • 首先,在 pandas 数据框中加载数据

  • 其次,检查 TIME 列必须是日期时间对象,而不是对象类型(如字符串或其他)。您可以通过以下方式检查

    df.info()

例如,在我的情况下,TIME 列最初是对象类型,即字符串类型

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME           17640 non-null object
value          17640 non-null int64
dtypes: int64(1), object(2)
memory usage: 413.5+ KB
  • 如果是这种情况,则使用此命令将其转换为 pandas 日期时间对象

    df['TIME'] = pd.to_datetime(df['TIME']) 如果已经是日期时间格式,请忽略它

df.info()现在给出更新的格式

 <class 'pandas.core.frame.DataFrame'>
 RangeIndex: 17640 entries, 0 to 17639
 Data columns (total 3 columns):
 TIME           17640 non-null datetime64[ns]
 value          17640 non-null int64
 dtypes: datetime64[ns](2), int64(1)
 memory usage: 413.5 KB
  • 现在我们的数据框已经准备好使用魔法了 :)

       counts = pd.Series(index=df.TIME, data=np.array(df.count)).resample('15T').count()
       print(counts[:3])
    
     TIME
     2017-07-01 00:00:00    3
     2017-07-01 00:15:00    3
     2017-07-01 00:30:00    3
     Freq: 15T, dtype: int64
    

    在上面的命令中15T表示 15 分钟的存储桶,您可以将其替换D为日存储桶、2D2 天存储桶、M月存储桶、2M2 个月存储桶等。您可以在此 [链接][2] 上阅读这些符号的详细信息

  • 现在,如上所示,我们的存储桶数据已完成。对于时间范围,请使用此命令。使用与数据相同的时间范围。就我而言,我的数据是 3 个月,所以我创建了 3 个月的时间范围。

 r = pd.date_range('2017-07', '2017-09', freq='15T')
 x = np.repeat(np.array(r), 2, axis=0)[1:-1]
 # now reshape data to fit in Dataframe
 x = np.array(x)[:].reshape(-1, 2)
 # now fit in dataframe and print it
 final_df = pd.DataFrame(x, columns=['start', 'end'])
 print(final_df[:3])
                  start                 end
0   2017-07-01 00:00:00 2017-07-01 00:15:00
1   2017-07-01 00:15:00 2017-07-01 00:30:00
2   2017-07-01 00:30:00 2017-07-01 00:45:00

日期范围也完成了

  • 现在附加计数和日期范围以获得最终结果

     final_df['count'] = np.array(means)
     print(final_df[:3])
    
                  start                 end count
0   2017-07-01 00:00:00 2017-07-01 00:15:00     3
1   2017-07-01 00:15:00 2017-07-01 00:30:00     3
2   2017-07-01 00:30:00 2017-07-01 00:45:00     3

希望任何人都觉得它有用。[1]:https ://pypi.org/project/pandas/ [2]:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html#pandas。 Series.resample

于 2019-10-11T19:57:24.377 回答
1

好吧,我不确定这是你要求的。如果不是,我建议您改进您的问题,因为很难理解您的问题。特别是,很高兴看到您已经尝试过做什么。

from __future__ import division, print_function
from collections import namedtuple
from itertools import product
from datetime import time
from StringIO import StringIO


MAX_HOURS = 23
MAX_MINUTES = 59


def process_data_file(data_file):
    """
    The data_file is supposed to be an opened file object
    """
    time_entry = namedtuple("time_entry", ["time", "count"])
    data_to_bin = []
    for line in data_file:
        t, count = line.rstrip().split("\t")
        t = map(int, t.split()[-1].split(":")[:2])
        data_to_bin.append(time_entry(time(*t), int(count)))
    return data_to_bin


def make_milestones(min_hour=0, max_hour=MAX_HOURS, interval=15):
    minutes = [minutes for minutes in xrange(MAX_MINUTES+1) if not minutes % interval]
    hours = range(min_hour, max_hour+1)
    return [time(*milestone) for milestone in list(product(hours, minutes))]


def bin_time(data_to_bin, milestones):
    time_entry = namedtuple("time_entry", ["time", "count"])
    data_to_bin = sorted(data_to_bin, key=lambda time_entry: time_entry.time, reverse=True)
    binned_data = []
    current_count = 0
    upper = milestones.pop()
    lower = milestones.pop()
    for entry in data_to_bin:
        while not lower <= entry.time <= upper:
            if current_count:
                binned_data.append(time_entry("{}-{}".format(str(lower)[:-3], str(upper)[:-3]), current_count))
                current_count = 0
            upper, lower = lower, milestones.pop()
        current_count += entry.count
    return binned_data


data_file = StringIO("""1-1-1900 10:41:00\t1
3-1-1900 09:54:00\t1
4-1-1900 15:45:00\t1
5-1-1900 18:41:00\t1
4-1-1900 15:45:00\t1""")


binned_time = bin_time(process_data_file(data_file), make_milestones())
for entry in binned_time:
    print(entry.time, entry.count, sep="\t")

输出:

18:30-18:45 1
15:45-16:00 2
10:30-10:45 1
于 2015-05-10T22:29:32.627 回答
0

只是尝试没有熊猫:

from collections import defaultdict
import datetime as dt
from itertools import groupby

def bin_ts(dtime, delta):
    modulo = dtime.timestamp() % delta.total_seconds()
    return dtime - dt.timedelta(seconds=modulo)

src_data = [
    ('1-1-1900 10:41:00', 1),
    ('3-1-1900 09:54:00', 1),
    ('4-1-1900 15:45:00', 1),
    ('5-1-1900 18:41:00', 1),
    ('4-1-1900 15:45:00', 1)
]

ts_data = [(dt.datetime.strptime(ts, '%d-%m-%Y %H:%M:%S'), count) for ts, count in src_data]

bin_size = dt.timedelta(minutes=15)

binned = [(bin_ts(ts, bin_size), count) for ts, count in ts_data]

def time_fmt(ts):
    res = "%s - %s" % (ts.strftime('%H:%M'), (ts + bin_size).strftime('%H:%M'))
    return res

binned_time = [(time_fmt(ts), count) for ts, count in binned]

cnts = defaultdict(int)
for ts, group in groupby(binned_time, lambda x: x[0]):
    for row in group:
        cnts[ts] += row[1]

output = list(cnts.items())

output.sort(key=lambda x: x[0])

from pprint import pprint
pprint(output)

导致:

[('09:45 - 10:00', 1),
 ('10:30 - 10:45', 1),
 ('15:45 - 16:00', 2),
 ('18:30 - 18:45', 1)]
于 2022-01-30T15:47:37.583 回答