我想知道如何检查频率。
频率是指,例如,在线用户多久进入在线频道。
所以,我想得到一些用户是否是重度用户的索引。
这是一个示例数据集。
d <- data.table(
timestamp = paste0('202001', str_pad(rep(1:30,each=3), width = 2, side = 'left', pad = '0')),
user = sample(x=LETTERS[1:5], size = 90, replace = T),
value = rnorm(90)
)
head(d[user == 'B'], 10)
# timestamp user value
# 1: 2020-01-01 B -0.05698572
# 2: 2020-01-01 B -0.16677841
# 3: 2020-01-03 B 0.06953150
# 4: 2020-01-04 B 0.29374589
# 5: 2020-01-05 B 0.59508578
# 6: 2020-01-06 B -0.16237362
# 7: 2020-01-07 B -0.34246076
# 8: 2020-01-07 B -0.04670312
# 9: 2020-01-08 B 1.92830277
# 10: 2020-01-08 B 2.04701468
Q1。那么,如何证明用户 B 是 20200101 和 20200108(忽略值列)
Q2 之间的重度用户。是否有任何指标来描述频率?
Q3。我曾经计算日期差异分布(平均值,标准)。是好方法吗?例如,以下..
sam <- head(d[user == 'B'], 10)
sam[, timestamp := as.Date(timestamp, format = "%Y%m%d")]
sam[, lag_timestamp := dplyr::lag(timestamp)]
sam[, diff_prev_date := timestamp - lag_timestamp]
sam
# timestamp user value lag_timestamp diff_prev_date
# 1: 2020-01-01 B -0.05698572 <NA> NA days
# 2: 2020-01-01 B -0.16677841 2020-01-01 0 days
# 3: 2020-01-03 B 0.06953150 2020-01-01 2 days
# 4: 2020-01-04 B 0.29374589 2020-01-03 1 days
# 5: 2020-01-05 B 0.59508578 2020-01-04 1 days
# 6: 2020-01-06 B -0.16237362 2020-01-05 1 days
# 7: 2020-01-07 B -0.34246076 2020-01-06 1 days
# 8: 2020-01-07 B -0.04670312 2020-01-07 0 days
# 9: 2020-01-08 B 1.92830277 2020-01-07 1 days
# 10: 2020-01-08 B 2.04701468 2020-01-08 0 days
plot(density(as.numeric(sam$diff_prev_date), na.rm = T), main = "")
mean(sam$diff_prev_date, na.rm = T) # Time difference of 0.7777778 days
sqrt(var(sam$diff_prev_date, na.rm = T)) # 0.6666667