xml - Impala 可以查询存储在 Hadoop/HDFS 中的 XML 文件吗

Question

我正在研究 Hadoop/Impala 组合是否能满足我的归档、批处理和实时即席查询要求。

我们将把 XML 文件（格式良好并符合我们自己的 XSD 模式）持久化到 Hadoop 中，并使用 MapReduce 处理日终批处理查询等。对于需要低延迟和相对高性能的临时用户查询和应用程序查询，我们'正在考虑 Impala。

我想不通的是 Impala 将如何理解 XML 文件的结构以便它可以有效地查询。Impala 能否用于以有意义的方式跨 XML 文档进行查询？

提前致谢。

score 3 · Accepted Answer

Hive 和 Impala 并没有真正的机制来处理 XML 文件（这很奇怪，考虑到大多数数据库中的 XML 支持）。

话虽如此，如果我遇到这个问题，我会使用 Pig 将数据导入 HCatalog。那时，Hive 和 Impala 完全可以使用它。

这是一个使用 Pig 将一些 XML 数据导入 HCatalog 的快速而肮脏的示例：

--rss.猪

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;

- 结果

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)

--Impala 查询

select * from rss_items

--Impala 结果

    link    title   description pubdate
0   http://www.hannonhill.com/news/item1.html   News Item 1 Description of news item 1 here.    03 Jun 2003 09:39:21
1   http://www.hannonhill.com/news/item2.html   News Item 2 Description of news item 2 here.    30 May 2003 11:06:42
2   http://www.hannonhill.com/news/item3.html   News Item 3 Description of news item 3 here.    20 May 2003 08:56:02

--rss.txt 数据文件

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>webmaster@hannonhill.com</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>

score 1 · Accepted Answer

目前看来，您对 Impala 和 XML 的运气并不好。Impala 使用 Hive 元存储，但不支持自定义InputFormats 和SerDes。您可以在此处查看它们本机支持的格式。

您可以使用 Hive，并且较新的版本应该更快（0.12+）

score 1 · Accepted Answer

另一种方法是快速将一堆 XML 转换为 avro，并使用 avro 文件为 hive 或 impala 中定义的表提供动力。

XMLSlurper 可用于解析 XML 文件中的记录

score 0 · Accepted Answer

0

您可以在此处尝试用于 Hive 的 XML SerDe

于 2014-10-30T19:48:08.980 回答

xml - Impala 可以查询存储在 Hadoop/HDFS 中的 XML 文件吗

4 回答 4

Related

Reference