1

我想知道如何检查频率。
频率是指,例如,在线用户多久进入在线频道。
所以,我想得到一些用户是否是重度用户的索引。

这是一个示例数据集。

d <- data.table(
  timestamp = paste0('202001', str_pad(rep(1:30,each=3), width = 2, side = 'left', pad = '0')),
  user = sample(x=LETTERS[1:5], size = 90, replace = T),
  value = rnorm(90)
)

head(d[user == 'B'], 10)

#     timestamp user       value
# 1: 2020-01-01    B -0.05698572
# 2: 2020-01-01    B -0.16677841
# 3: 2020-01-03    B  0.06953150
# 4: 2020-01-04    B  0.29374589
# 5: 2020-01-05    B  0.59508578
# 6: 2020-01-06    B -0.16237362
# 7: 2020-01-07    B -0.34246076
# 8: 2020-01-07    B -0.04670312
# 9: 2020-01-08    B  1.92830277
# 10: 2020-01-08    B  2.04701468


Q1。那么,如何证明用户 B 是 20200101 和 20200108(忽略值列)
Q2 之间的重度用户。是否有任何指标来描述频率?
Q3。我曾经计算日期差异分布(平均值,标准)。是好方法吗?例如,以下..

sam <- head(d[user == 'B'], 10)

sam[, timestamp := as.Date(timestamp, format = "%Y%m%d")]
sam[, lag_timestamp := dplyr::lag(timestamp)]
sam[, diff_prev_date := timestamp - lag_timestamp]
sam

#     timestamp user       value lag_timestamp diff_prev_date
# 1: 2020-01-01    B -0.05698572          <NA>        NA days
# 2: 2020-01-01    B -0.16677841    2020-01-01         0 days
# 3: 2020-01-03    B  0.06953150    2020-01-01         2 days
# 4: 2020-01-04    B  0.29374589    2020-01-03         1 days
# 5: 2020-01-05    B  0.59508578    2020-01-04         1 days
# 6: 2020-01-06    B -0.16237362    2020-01-05         1 days
# 7: 2020-01-07    B -0.34246076    2020-01-06         1 days
# 8: 2020-01-07    B -0.04670312    2020-01-07         0 days
# 9: 2020-01-08    B  1.92830277    2020-01-07         1 days
# 10: 2020-01-08    B  2.04701468    2020-01-08         0 days

plot(density(as.numeric(sam$diff_prev_date), na.rm = T), main = "")
mean(sam$diff_prev_date, na.rm = T)          # Time difference of 0.7777778 days
sqrt(var(sam$diff_prev_date, na.rm = T))     # 0.6666667
4

0 回答 0