I am learning Hive and wanted to write an optimized HiveQL/SQL query
My table looks like this:
CREATE TABLE sales (dealer VARCHAR(25), make VARCHAR(25), type VARCHAR(25), day INT);
INSERT INTO sales (dealer, make, type, day) VALUES
("Xyz", "Highlander", "SUV", "0"),
("Xyz", "Prius", "HATCH", "1"),
("Xyz", "Prius", "HATCH", "2"),
("Xyz", "Prius", "HATCH", "3"),
("Xyz", "Versa", "HATCH", "1"),
("Xyz", "Versa", "HATCH", "2"),
("Xyz", "Versa", "HATCH", "3"),
("Xyz", "S3", "SEDAN", "1"),
("Xyz", "S3", "SEDAN", "2"),
("Abc", "Forrester", "SUV", "1");
Given a "dealer" D, I want to compute the top N "make" for each "type" in the past X days, in a single query.
SELECT dealer, make, type, COUNT(*) AS frequency FROM sales
WHERE day > 0 AND dealer LIKE 'Xyz' GROUP BY make, type
ORDER BY frequency DESC LIMIT 5
The problem is when using GROUP BY on "make" and "type" for top 1, I will only get:
DEALER, MAKE, TYPE, COUNT
Xyz, Prius, Hatch, 3
Xyz, Versa, Hatch, 3
Xyz, S3, Sedan, 2
...
But I want
Xyz, Prius, Hatch, 3
Xyz, S3, Sedan, 2
...
for EACH "type" the top N.
Could someone help me understand how to write such a query?
SQL Fiddle http://sqlfiddle.com/#!2/df9304/5
****Update****
Seems like rank() would be useful:
Hive getting top n records in group by query