2

我想将我的等级分组为数据块。我想过使用 CASE 语句,但这不仅看起来很傻,而且速度也很慢

关于如何改进的任何提示?

请注意块的大小不同(首先列出前 100 个块,然后是 100 个块,然后是 1000 个块,然后是 5000 个块和 3 个其他 15K 块)

select   
  transaction_code
  ,row_number() over (order by SALES_AMOUNT desc) as rank
  ,SALES_AMOUNT
  ,CASE 
    WHEN rank <=100 THEN to_varchar(rank)
    WHEN rank <=200 then '101-200'
    WHEN rank <=300 then '201-300'
    WHEN rank <=400 then '301-400'
    WHEN rank <=500 then '401-500'
    WHEN rank <=1000 then '501-1000'
    WHEN rank <=1500 then '1001-1500'
    WHEN rank <=2000 then '1501-2000'
    WHEN rank <=2500 then '2001-2500'
    WHEN rank <=3000 then '2501-3000'
    WHEN rank <=3500 then '3001-3500'
    WHEN rank <=4000 then '3501-4000'
    WHEN rank <=4500 then '4001-4500'
    WHEN rank <=5000 then '4501-5000'
    WHEN rank <=5500 then '5001-5500'
    WHEN rank <=6000 then '5501-6000'
    WHEN rank <=6500 then '6001-6500'
    WHEN rank <=7000 then '6501-7000'
    WHEN rank <=7500 then '7001-7500'
    WHEN rank <=8000 then '7501-8000'
    WHEN rank <=8500 then '8001-8500'
    WHEN rank <=9000 then '8501-9000'
    WHEN rank <=95000 then '9001-9500'
    WHEN rank <=10000 then '9501-10000'
    WHEN rank <=15000 then '10001-15000'
    WHEN rank <=30000 then '15001-30000'
    WHEN rank <=45000 then '30001-45000'
    WHEN rank <=60000 then '45001-60000'
    ELSE 'Bottom'
   END AS "TRANSACTION GROUPS"
4

1 回答 1

0

最快的方法是创建一个将排名映射到组名的查找表。您可以使用有状态的 JavaScript UDF(只初始化一次地图)来完成它。

但你也可以在 SQL 中完成

表定义

从数字到字符串的简单映射

create or replace table rank2group(rank integer, grp string);

UDF 生成组名

你的代码确实很长。

相反,我们可以创建一个函数,该函数针对给定的排名 ,group_sizegroup_base(表单组中的数字group_size)生成一个字符串。

注意,这个函数会比你的代码慢,因为它会从输入生成一个字符串,但我们只会用它来填充查找表,所以没关系。

create or replace function group_name(rank integer, group_base integer, group_size integer)
returns varchar
as $$
  (group_base + 1 + group_size * floor((rank - 1 - group_base) / group_size))
  || '-' || 
  (group_base + group_size + group_size * floor((rank - 1 - group_base) / group_size))
$$;

示例输出:

select group_name(101, 100, 100), group_name(1678, 500, 500), group_name(15000, 10000, 5000);
---------------------------+----------------------------+--------------------------------+
 GROUP_NAME(101, 100, 100) | GROUP_NAME(1678, 500, 500) | GROUP_NAME(15000, 10000, 5000) |
---------------------------+----------------------------+--------------------------------+
 101-200                   | 1501-2000                  | 10001-15000                    |
---------------------------+----------------------------+--------------------------------+

表数据生成

我们将1 .. 60000使用 Snowflake 生成器group_name和您的简化CASE语句生成仅映射范围的值:

创建或替换表 rank2group(rank integer, grp string);

insert into rank2group
select rank,
CASE 
    WHEN rank <=100 THEN to_varchar(rank)
    -- groups of size 100, starting at 100
    WHEN rank <=500 then group_name(rank, 100, 100)                                                    
    WHEN rank <=10000 then group_name(rank, 500, 500)
    -- groups of size 5000, starting at 10000
    WHEN rank <=15000 then group_name(rank, 10000, 5000) 
    WHEN rank <=60000 then group_name(rank, 15000, 15000)
    ELSE 'Bottom'
END AS "TRANSACTION GROUPS"
from (
    select row_number() over (order by 1) as rank
    from table(generator(rowCount=>60000))
);

用法

要使用它,我们只需加入rank. 请注意,您需要一个outer join后跟ifnullBottom值。例如,使用生成input指数增长的数字:

with input as (
  select 1 + (seq8() * seq8() * seq8()) AS rank
  from table(generator(rowCount=>50))
)
select input.rank, ifnull(grp, 'Bottom') grp
from input left outer join rank2group on input.rank = rank2group.rank
order by input.rank;
--------+-------------+
  RANK  |     GRP     |
--------+-------------+
 1      | 1           |
 2      | 2           |
 9      | 9           |
 28     | 28          |
 65     | 65          |
 126    | 101-200     |
 217    | 201-300     |
 344    | 301-400     |
 513    | 501-1000    |
 730    | 501-1000    |
 1001   | 1001-1500   |
 1332   | 1001-1500   |
 1729   | 1501-2000   |
 2198   | 2001-2500   |
 2745   | 2501-3000   |
 3376   | 3001-3500   |
 4097   | 4001-4500   |
 4914   | 4501-5000   |
 5833   | 5501-6000   |
 6860   | 6501-7000   |
 8001   | 8001-8500   |
 9262   | 9001-9500   |
 10649  | 10001-15000 |
 12168  | 10001-15000 |
 13825  | 10001-15000 |
 15626  | 15001-30000 |
 17577  | 15001-30000 |
 19684  | 15001-30000 |
 21953  | 15001-30000 |
 24390  | 15001-30000 |
 27001  | 15001-30000 |
 29792  | 15001-30000 |
 32769  | 30001-45000 |
 35938  | 30001-45000 |
 39305  | 30001-45000 |
 42876  | 30001-45000 |
 46657  | 45001-60000 |
 50654  | 45001-60000 |
 54873  | 45001-60000 |
 59320  | 45001-60000 |
 64001  | Bottom      |
 68922  | Bottom      |
 74089  | Bottom      |
 79508  | Bottom      |
 85185  | Bottom      |
 91126  | Bottom      |
 97337  | Bottom      |
 103824 | Bottom      |
 110593 | Bottom      |
 117650 | Bottom      |
--------+-------------+

可能的优化

如果您的范围始终是倍数或 100,您可以将表缩小 100 倍,只存储以 结尾的值00,然后加入例如CEIL(rank)+1

但是你还需要1..100在连接之后处理值,例如IFNULL(grp, IFF(rank <= 100, rank::varchar, 'Bottom'))

于 2018-06-06T00:14:55.990 回答