0

在 Redshift 中给出以下查询:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

我需要将上面的 Query 转换为相应的 Presto 语法。我写的相应的 Presto 查询是:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

在这里,一切正常,但在行中显示错误:

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

它对应的 Presto 语法是什么?

4

2 回答 2

0

我正在对 presto 的中位数进行一些研究,并找到了一个对我有用的解决方案:

例如,我有一个连接表 A_join_B,其中包含列 A_id 和 B_id。

我想找到与单个 B 相关的 A 数量的中位数

SELECT APPPROX_PERCENTILE(count, 0.5) FROM (SELECT COUNT(*) AS count,narrative_id FROM A_join_B GROUP BY B_id) 作为计数;

于 2020-04-23T19:08:12.937 回答
0

如果 Presto 支持嵌套窗口函数,那么您可以使用NTH_VALUE和 p*COUNT(*) OVER (PARTITION BY ...) 来查找对应于窗口中“p'th”百分位数的偏移量。由于 Presto 不支持这一点,您需要加入一个计算窗口中记录数的子查询:

SELECT
  my_table.window_column,
  /* Replace :p with the desired percentile (in your case, 0.02) */
  NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
    OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
  SELECT
    window_column,
    COUNT(*) AS records_in_window
  FROM my_table
  GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column

以上在概念上很接近但失败了,因为:p*subquery.records_in_window它是一个浮点数并且偏移量需要是一个整数。你有几个选项来处理这个问题。例如,如果您要找到中位数,那么只需四舍五入到最接近的整数即可。如果您要找到第二个百分位数,则四舍五入将不起作用,因为它通常会给您 0 并且偏移量从 1 开始。在这种情况下,将上限四舍五入到最接近的整数可能会更好。

于 2018-02-14T21:34:28.373 回答