1

所以这个问题的基本前提是我在 hadoop 中有一些巨大的表,我需要从每个月获取一些样本。我已经模拟了下面的内容以显示我所追求的那种东西,但显然这不是真实的数据......

--Create the table
CREATE TABLE exp_dqss_team.testranking (
  Name STRING,
  Age INT,
  Favourite_Cheese STRING
  ) STORED AS PARQUET;

--Put some data in
INSERT INTO TABLE exp_dqss_team.testranking
VALUES (
  ('Tim', 33, 'Cheddar'),
  ('Martin', 49, 'Gorgonzola'),
  ('Will', 39, 'Brie'),
  ('Bob', 63, 'Cheddar'),
  ('Bill', 35, 'Brie'),
  ('Ben', 42, 'Gorgonzola'),
  ('Duncan', 55, 'Brie'),
  ('Dudley', 28, 'Cheddar'),
  ('Edmund', 27, 'Brie'),
  ('Baldrick', 29, 'Gorgonzola'));

我想要的是每个奶酪类别中最年轻的两个人。下面给了我每个奶酪类别的年龄排名,但不会限制在前两个:

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking;

如果我添加一个WHERE子句,它会给我以下错误:

WHERE 子句不能包含解析表达式

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
WHERE RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) <3;

有没有比创建一个包含所有排名的表,然后从排名中选择一个WHERE子句更好的方法来做到这一点?

4

1 回答 1

2

你能试试这个吗?

select * from (
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age
FROM exp_dqss_team.testranking
) as temp
where rank_my_cheese <= 2;
于 2017-03-06T08:55:35.040 回答