1

I have some data with timestamps and location data like follows:

A  2013-02-05 19:45:00    (39.94, -86.159)
A  2013-02-05 19:55:00    (39.94, -86.159)
A  2013-02-05 20:00:00   (39.777, -85.995)
A  2013-02-05 20:05:00   (39.775, -85.978)
B  2013-02-05 22:20:00   (39.935, -86.159)
B  2013-02-05 22:25:00   (39.935, -86.159)
B  2013-02-05 23:55:00   (39.951, -86.151)
B  2013-02-06 00:00:00   (39.951, -86.151)
B  2013-02-06 00:05:00   (39.906, -86.196)
C  2013-02-06 00:25:00    (39.82, -86.249)
C  2013-02-06 00:30:00    (39.82, -86.249)
C  2013-02-06 02:45:00   (41.498, -81.527)
C  2013-02-06 02:55:00   (41.498, -81.527)
C  2013-02-06 04:35:00    (39.82, -86.249)
C  2013-02-06 04:40:00    (39.82, -86.249)

What I need to do is that for each user for each day get a histogram of the number of times someone was in one location continuously. Hence, I want to mark each continuous period where the location remains the same for each user, each day.

How would I go about that in python pandas?

Cases that the location repeats for a user in one day is possible as shown for user C, the location (39.82. -86.249) occurs again. So, those cases are to be considered separate continuous times.

4

2 回答 2

1

我认为您正在寻找 pd.Series.shift

x = pd.Series([1, 3, 3, 2, 3, 3])

x
0    1
1    3
2    3
3    2
4    3
5    3

x.shift(-1)
0     3
1     3
2     2
3     3
4     3
5   NaN

(x != x.shift(-1)).sum()
4

假设问题中的数据是

df[['COL1', 'COL2', 'COL3']]

然后,这应该可以为您提供每个用户/每天的独特位置数量。我不确定这是否正是您想要的,但应该有助于开始

df['DATE'] = df.COL2.apply(lambda s: pd.to_datetime(s).date())
df.groupby(['COL1', 'DATE']).apply(lambda sdf: (sdf.COL3 != sdf.COL3).sum())
于 2013-03-27T16:01:10.980 回答
0

你的意思是这样的吗?

In [5]: df
Out[5]: 
    0                   1       2       3
0   A 2013-02-05 19:45:00  39.940 -86.159
1   A 2013-02-05 19:55:00  39.940 -86.159
2   A 2013-02-05 20:00:00  39.777 -85.995
3   A 2013-02-05 20:05:00  39.775 -85.978
4   B 2013-02-05 22:20:00  39.935 -86.159
5   B 2013-02-05 22:25:00  39.935 -86.159
6   B 2013-02-05 23:55:00  39.951 -86.151
7   B 2013-02-06 00:00:00  39.951 -86.151
8   B 2013-02-06 00:05:00  39.906 -86.196
9   C 2013-02-06 00:25:00  39.820 -86.249
10  C 2013-02-06 00:30:00  39.820 -86.249
11  C 2013-02-06 02:45:00  41.498 -81.527
12  C 2013-02-06 02:55:00  41.498 -81.527
13  C 2013-02-06 04:35:00  39.820 -86.249
14  C 2013-02-06 04:40:00  39.820 -86.249

In [6]: def gb(df, *args, **kwargs):
   ...:     for k, g in df.groupby(*args, **kwargs):
   ...:         splt = np.split(g, np.where(np.diff(g.index.values)!=1)[0]+1)
   ...:         for subg in splt:
   ...:                 if len(subg) >=2: yield k, subg
   ...:             

In [7]: group_args = [0, df[1].apply(lambda x:x.date()), 2 , 3]

In [8]: for key, grp in gb(df, group_args, sort=False):
   ...:     print key
   ...:     print grp
   ...:     print '-'*10
   ...:  

印刷:

('A', datetime.date(2013, 2, 5), 39.94, -86.159)
   0                   1      2       3
0  A 2013-02-05 19:45:00  39.94 -86.159
1  A 2013-02-05 19:55:00  39.94 -86.159
----------
('B', datetime.date(2013, 2, 5), 39.935, -86.159)
   0                   1       2       3
4  B 2013-02-05 22:20:00  39.935 -86.159
5  B 2013-02-05 22:25:00  39.935 -86.159
----------
('C', datetime.date(2013, 2, 6), 39.82, -86.249)
    0                   1      2       3
9   C 2013-02-06 00:25:00  39.82 -86.249
10  C 2013-02-06 00:30:00  39.82 -86.249
----------
('C', datetime.date(2013, 2, 6), 39.82, -86.249)
    0                   1      2       3
13  C 2013-02-06 04:35:00  39.82 -86.249
14  C 2013-02-06 04:40:00  39.82 -86.249
----------
('C', datetime.date(2013, 2, 6), 41.498, -81.527)
    0                   1       2       3
11  C 2013-02-06 02:45:00  41.498 -81.527
12  C 2013-02-06 02:55:00  41.498 -81.527
于 2013-03-27T17:50:40.350 回答