我在安装了 Hive 的 HDFS 上运行了 Hadoop。我可以 通过以下命令将维基百科转储http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2导入 HDFS:
$ hadoop jar out.jar edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input /home/wikimedia/input/ enwiki-latest-pages-articles.xml -output /home/wikimedia/output/3
我根据转换的小数据创建了一个示例配置单元表
我可以通过以下命令为 Wikipedia 转储运行 Hive:
CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/home/wikimedia/output/3';
它为我创建了一条记录,如下所示:
Davy Jones (musician) Davy Jones (musician) David Thomas "Davy" Jones (30 December 1945 – 29 February 2012) was an English recording artist and actor, best known as a member of The Monkees. Early lifeDavy Jones was born at 20 Leamington Street, Openshaw, Manchester, England, on 30 December 1945. At age 11, he began his acting career…
我的总体目标是了解有多少贡献者来自印度和中国。有什么建议可以实现吗?