1

我对 HiveQL 很陌生,我有点卡住了:S

我有一个以下模式的表。一个名为 res 的列和三个在 partion_column 下分区的名为 filed。

create table results( res string) PARTITIONED BY (field STRING); 

然后我在这个表中导入数据

insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

我正在尝试计算三个分区中的唯一管。

例如,此命令计算数据集中某些标题的存在次数。

 SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

它输出类似

 title                                count        
 Hit me Baby More time                   9

如何将其扩展到元组(标题、专辑、艺术家)?如果我想有这样的输出:

title                            album                 artist       count

Baby one more time    hit me baby one more time    britney spears    9

我的整个代码:

CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable  (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';

create view xmlout(line) as  select * from hivetesttable;  

CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl')  from xmlout;
CREATE VIEW ARTIST(line) as select  xpath(line,'/MC/SC/*/@art')  from xmlout;
CREATE VIEW ALBUMS( line) as select   xpath(line,'/MC/SC/*/@art') from xmlout;



create table results( res string) PARTITIONED BY (field STRING); 
insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

一行xml数据就像

<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast &amp; Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;] - Single" ttl="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>
4

1 回答 1

1

根据您提供的信息,您想要的输出是不可能的。现在你有一个看起来像这样的表:

res                           field
---                           -----
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
the distance                  title
the distance                  title
open book                     title
daria                         title
fashion nugget                album
fashion nugget                album
fashion nugget                album
fashion nugget                album
cake                          artist
cake                          artist
cake                          artist
cake                          artist

因为你对它进行了分区,Hive 恰好将它存储在三个不同的文件夹中,但这不会影响查询的结果。我添加了一些额外的曲目,我想你希望输出的额外曲目是(如果我错了,请纠正我):

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

但是没有办法说“打开的书”与“时尚金块”或“蛋糕”有任何关系,就像没有办法说“宝贝再来一次”与“布兰妮斯皮尔斯”有关。您可以尝试匹配计数,但最终会得到这样的结果

title                  album                       artist              count
baby one more time     hit me baby one more time   britney spears      9
null                   fashion nuggets             cake                3
the distance           null                        null                1
open book,daria        null                        null                1

我想你想要一张这样的表格

title                  album                         artist
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
the distance           fashion nuggets               cake
the distance           fashion nuggets               cake
open book              fashion nuggets               cake
daria                  fashion nuggets               cake

但仍按艺术家和/或专辑划分。无论有没有分区,您都可以编写查询,就好像表没有分区一样(只要数据没有损坏,它就不会影响结果,只会影响性能)。但是,它会影响您创建和填充表格的方式。让我知道这是否是您想要的,我将编辑此答案以回答该问题。


承诺的编辑:

好的,创建没有任何分区的表很简单:

CREATE TABLE results (title string, album string, artist string)

使用分区创建表几乎同样简单,您只需要首先确定要分区的内容。如果您对艺术家进行分区,则意味着您可以运行特定于单个或一组艺术家的查询,而无需处理其他艺术家的信息。如果您按艺术家和专辑进行分区,您也可以对专辑做同样的事情。这确实是以将大文件分解为小文件为代价的,通常 MapReduce(因此 Hive)更适用于大文件。我根本不会担心分区,除非您处理至少 10 的 GB 并且感觉您掌握了分区的工作原理和 HiveQL 的一般情况。但为了完整起见,按艺术家划分:

CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);

并按艺术家然后按专辑分区。(artist string, album string)按vs分区(album string, artist string)不会改变你的结果,但你应该把层次结构的逻辑顶部放在第一位。

CREATE TABLE (title string) PARTITIONED BY (artist string, album string);

如果我们可以访问的唯一信息来自表格,那么填充此表格将并不容易,titles, artists, and albums因为我们有一个巨大的标题、艺术家和专辑列表,但无法分辨哪个标题与哪个专辑搭配。我希望你有一些数据,这些关系仍然完好无损,或者你的数据集仍然完好无损。在不知道这些假设数据的形式的情况下,我无法为如何填充您的表格提供答案。但是,如果您有分区表,那么如果您不想手动指定每个艺术家和专辑(因为每个艺术家都有自己的分区,并且在分区内每个专辑都有自己的分区) ,这个答案可能对您有用。

编辑:提问者有 xml 文件,其中包含完整的标题、ablum、arist 关系。评论中有关此的更多信息。

现在问题的核心是计算唯一元组。无论数据如何分区(如果有的话),这都是相同的。我们使用GROUP BY子句来做到这一点。当您指定一个列(或分区,可以将其视为具有特殊属性的列)时,您会将数据分解为具有该列不同值的组。如果您指定多个列,则将数据分解为具有不同列组合的值的组。这是我们用来计算不同元组的优势:

SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist

我们在这里:

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1
于 2013-03-21T20:27:39.620 回答