4

假设我们有一个这样的列表,显示特定日期(mm-dd-yyyy-hour-minute)每个对象的计数:

A = [
 [
    ['07-07-2012-21-04', 'orange', 1],
    ['08-16-2012-08-57', 'orange', 1],
    ['08-18-2012-03-30', 'orange', 1],
    ['08-18-2012-03-30', 'orange', 1],
    ['08-19-2012-03-58', 'orange', 1],
    ['08-19-2012-03-58', 'orange', 1],
    ['08-19-2012-04-09', 'orange', 1],
    ['08-19-2012-04-09', 'orange', 1],
    ['08-19-2012-05-21', 'orange', 1],
    ['08-19-2012-05-21', 'orange', 1],
    ['08-19-2012-06-03', 'orange', 1],
    ['08-19-2012-07-51', 'orange', 1],
    ['08-19-2012-08-17', 'orange', 1],
    ['08-19-2012-08-17', 'orange', 1]
 ],
 [
    ['07-07-2012-21-04', 'banana', 1]
 ],
 [
    ['07-07-2012-21-04', 'mango', 1],
    ['08-16-2012-08-57', 'mango', 1],
    ['08-18-2012-03-30', 'mango', 1],
    ['08-18-2012-03-30', 'mango', 1],
    ['08-19-2012-03-58', 'mango', 1],
    ['08-19-2012-03-58', 'mango', 1],
    ['08-19-2012-04-09', 'mango', 1],
    ['08-19-2012-04-09', 'mango', 1],
    ['08-19-2012-05-21', 'mango', 1],
    ['08-19-2012-05-21', 'mango', 1],
    ['08-19-2012-06-03', 'mango', 1],
    ['08-19-2012-07-51', 'mango', 1],
    ['08-19-2012-08-17', 'mango', 1],
    ['08-19-2012-08-17', 'mango', 1]
 ]

]

我需要在A中做的是为每个值为0的对象填充所有缺失的日期(从A的最小日期到最大日期)。一旦缺失的日期及其对应的值(0)在,我想求和向上每个日期的值,以便没有日期重复 - 对于每个子列表。

现在,我想要做的事情如下:我分别分解 A 的日期和值(在名为 u 和 v 的列表中)并将每个子列表转换为 pandas 系列,并将它们各自的索引分配给它们。所以对于每个 zip(u,v):

def generate(values, indices):

    indices = flatten(indices)

    date_index = DatetimeIndex(indices)
    ts = Series(values, index=date_index)

    ts.reindex(date_range(min(date_index), max(date_index)))

    return ts

但是在这里,重新索引导致引发异常。我正在寻找的是一种纯粹的pythonic方式(没有pandas),它完全基于列表理解甚至numpy数组。

还有另一个关于小时聚合的问题,这意味着如果所有日期都相同并且只有小时不同,那么我想填写一天中所有缺失的时间,然后在每个小时内重复相同的聚合过程,用 0 值填充的缺失小时数。

提前致谢。

4

1 回答 1

2

那这个呢:

from collections import defaultdict, OrderedDict                              
from datetime import datetime, timedelta                                      
from itertools import chain, groupby                                          

flat = sorted((datetime.strptime(d, '%m-%d-%Y-%H-%M').date(), f, c)           
              for (d, f, c) in chain(*A))                                     
counts = [(d, f, sum(e[2] for e in l))                                        
          for (d, f), l                                                       
          in groupby(flat, key=lambda t: (t[0], t[1]))]                       

# lets assume that there are some data                                        
start = counts[0][0]                                                          
end = counts[-1][0]                                                           
result = OrderedDict((start+timedelta(days=i), defaultdict(int))             
                     for i in range((end-start).days+1))                      
for day, data in groupby(counts, key=lambda d: d[0]):                         
    result[day].update((f, c) for d, f, c in data)

我的问题是:我们真的需要填写不存在的日期吗 - 我可以很容易地想象这将是大量数据,甚至是危险数据量的情况......我认为最好使用简单的通用函数和生成器如果你想在某处列出它们:

from collections import defaultdict                                           
from datetime import datetime, timedelta                                      
from itertools import chain, groupby                                          

def aggregate(data, resolution='daily'):                                      
    assert resolution in ['hourly', 'daily']                                  
    if resolution == 'hourly':                                                
        round_dt = lambda dt: dt.replace(minute=0, second=0, microsecond=0)   
    else:                                                                     
        round_dt = lambda dt: dt.date()                                       

    flat = sorted((round_dt(datetime.strptime(d, '%m-%d-%Y-%H-%M')), f, c)    
                  for (d, f, c) in chain(*A))                                 
    counts = [(d, f, sum(e[2] for e in l))                                    
              for (d, f), l                                                   
              in groupby(flat, key=lambda t: (t[0], t[1]))]
    result = {}                                                              
    for day, data in groupby(counts, key=lambda d: d[0]):                    
        d = result[day] = defaultdict(int)                                   
        d.update((f, c) for d, f, c in data)                                 
    return result                                                            

def xaggregate(data, resolution='daily'):                                      
    aggregated = aggregate(data, resolution)                                 
    curr = min(aggregated.keys())                                            
    end = max(aggregated.keys())                                             
    interval = timedelta(days=1) if resolution == 'daily' else timedelta(seconds=3600)
    while curr <= end:
        # None is sensible value in case of missing data I think                                                       
        yield curr, aggregated.get(curr)                   
        curr += interval                                                                                 

一般来说,我的建议是你不应该使用列表作为有序结构(我的意思是['07-07-2012-21-04', 'mango', 1])。我认为这tuple更适合这个目的,当然collections.namedtuple更受欢迎。

于 2013-09-01T13:24:24.497 回答