0

我在安装了 Hive 的 HDFS 上运行了 Hadoop。我可以 通过以下命令将维基百科转储http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2导入 HDFS:

$ hadoop jar out.jar edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input /home/wikimedia/input/ enwiki-latest-pages-articles.xml  -output /home/wikimedia/output/3

我根据转换的小数据创建了一个示例配置单元表

我可以通过以下命令为 Wikipedia 转储运行 Hive:

CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/home/wikimedia/output/3';

它为我创建了一条记录,如下所示:

Davy Jones (musician) Davy Jones (musician)           David Thomas "Davy" Jones (30 December 1945 – 29 February 2012) was an English recording artist and actor, best known as a member of The Monkees. Early lifeDavy Jones was born at 20 Leamington Street, Openshaw, Manchester, England, on 30 December 1945. At age 11, he began his acting career…

我的总体目标是了解有多少贡献者来自印度和中国。有什么建议可以实现吗?

4

0 回答 0