2

我正在尝试运行一个配置单元查询,该查询将生成一个包含域、键、值和计数的表,并按域/键/值的唯一组合进行分组。

数据示例:

http://www.aaa.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.aaa.com/path?key_a=5&key_b=goodb&key_c=yestr&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=goodb&key_c=ystrd

期望的输出:

aaa.com | key_a | 5 | 2
aaa.com | key_b | hello | 1
aaa.com | key_b | goodb | 1
aaa.com | key_c | today | 1
aaa.com | key_c | yestr | 1
aaa.com | key_d | blue | 2
bbb.com | key_a | 5 | 2
bbb.com | key_b | hello | 1
bbb.com | key_b | goodb | 1
bbb.com | key_c | today | 1
bbb.com | key_c | ystrd | 1
bbb.com | key_d | blue | 1

这是我一直在使用的:

"select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','='), count(1) from url_table group by select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','=') limit 10;"

我哪里错了?特别是我认为我搞砸的地方是: str_to_map(parse_url(url,'QUERY'),'&','=') 因为我不知道如何将查询字符串分成多个键值对和然后正确分组。

4

3 回答 3

2

您可以借助横向视图爆炸来实现这一点。

这应该工作:

hive> select parse_url(url,'HOST') as host, v.key as key, v.val,
count(*) as count from url u LATERAL VIEW
explode(str_to_map(parse_url(url,'QUERY'),'&','=')) v as key, val
group by parse_url(url, 'HOST'), v.key, v.val;
于 2013-10-02T20:08:13.800 回答
0

我已经验证了下面的查询应该可以工作:

SELECT
  parse_url(url, 'HOST') AS host,
  q.key AS key,
  q.val AS val,
  COUNT(*)
FROM <your_table_with_url_as_a_field>
LATERAL VIEW explode(str_to_map(parse_url(url,'QUERY'),'&','=')) q AS key, val
WHERE parse_url(url,'QUERY') IS NOT NULL
GROUP BY parse_url(url, 'HOST'), q.key, q.val
ORDER BY host, key, val;
于 2014-12-18T17:34:08.180 回答
0

解析 URL 元组

  1. src 表包含完整的url
  2. 然后按主机、路径、查询应用组

这样的事情将解决您的查询

SELECT 
  count(*), host, path, query
FROM ( 
  SELECT b.*
  FROM src 
  LATERAL VIEW parse_url_tuple(completeurl, 'HOST',
           'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id
     )
GROUP BY host, path, query ;

有关https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parse_url_tuple的更多详细信息,请参阅此处

于 2019-11-14T06:18:30.583 回答