0

假设我在 df 中有这种格式的数据

id      sta                   end                   dur
40433   2020-01-08 05:06:01   2020-01-08 05:08:14   133
40433   2020-09-22 12:01:26   2020-09-22 12:31:34   1808
40433   2020-09-22 12:05:00   2020-09-22 13:05:00   3600

也许在同一个 df 或一个新的 df 中,我想添加如下所示的记录:

id      sta                  end                   h1  dur
40433   2020-01-08 05:06:01  2020-01-08 05:08:14   05  133
40433   2020-09-22 12:01:26  2020-09-22 12:31:34   12  1808
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   12  3300
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   13  300

dur以秒为单位。

我想groupby id,然后day(从 中提取sta),然后,为特定时间()h1, h2, etc.聚合。durh1, etc.id

4

1 回答 1

0

根据您的评论修改答案。为了更快地转身,在尝试了其他一些方法后,我通过一些转换进行了数组数学。可能有一种更有效的方法,不确定它如何大规模执行,但它确实有效。需要注意的是,如果您的持续时间总计超过 24 小时,所有小时列的值将全部为 60 分钟,所以我不理会该条件,以便您可以根据需要处理它:

import cudf
import cupy as cp

#If your duration goes over 24 hours total, ALL hour column values will be all 60 minutes.

sta = ['2020-01-08 05:06:01', '2020-09-22 12:01:26', '2020-09-22 12:05:00', '2020-09-22 01:15:00', '2020-09-22 21:05:00']
end = ['2020-01-08 05:08:14', '2020-09-22 12:31:34', '2020-09-22 13:05:00', '2020-09-22 08:05:00', '2020-09-23 01:05:00']

#put it in a dataframe
df = cudf.DataFrame({'sta': sta, 'end':end})
print(df.head())

#the object is a string, so let's convert it to date time
df['sta']= df['sta'].astype('datetime64[s]')
df['end']=df['end'].astype('datetime64[s]')

df['dur']=(df['end']-df['sta']).astype('int64')

#create new df of same type to convert to cupy (to preserve datetime values)
df2=cudf.DataFrame() 
df2['dur']=(df['end']-df['sta']).astype('int64')
df2['min_sta'] =df['sta'].dt.minute.astype('int64')
df2['min_end']= df['end'].dt.minute.astype('int64')
df2['h_sta']= df['sta'].dt.hour.astype('int64')
df2['h_end']= df['end'].dt.hour.astype('int64')
df2['day']=df['sta'].dt.day.astype('int64')
print(df2)

#convert df2's values from df to cupy array (you can use numpy if on pandas)
a = cp.fromDlpack(df2.to_dlpack())
print(a)

#create new temp cupy array b to contain minute duration per hour.  This algo will work with numpy by using mumpy instead of cupy
b = cp.zeros((len(a),24))
for j in range(0,len(a)):
    hours = int((a[j][0]/3600)+(a[j][1]/60))
    if(hours==0): # within same hour
        b[j][a[j][3]] = int(a[j][0]/60)
    elif(hours==1): #you could probably delete this condition.
        b[j][a[j][3]] = 60-a[j][1]
        b[j][a[j][4]] = a[j][2]
    else:
        b[j][a[j][3]] = 60-a[j][1]
        if(hours<24): #all array elements will be all 60 minutes if duration is over 24 hours
            if(a[j][3]+hours<24):
                b[j][a[j][3]+1:a[j][3]+hours]=60
                b[j][a[j][4]] = a[j][2]
            else:
                b[j][a[j][3]+1:24]=60
                b[j][0:(a[j][3]+1+hours)%24]=60
                b[j][a[j][4]] = a[j][2]
# bring cupy array b back to a df. 
reshaped_arr = cp.asfortranarray(b)
cpdf = cudf.from_dlpack(reshaped_arr.toDlpack())
print(cpdf.head())

#concat the original and cupy df
df = cudf.concat([df, cpdf], axis=1)
print(df.head())
#you can rename the columns with "h" as you wish
于 2020-12-15T17:48:54.810 回答