mysql - 使用 Sqoop 从 MySQL 导入 Hive

Question

我必须通过 Sqoop 将超过 4 亿行从 MySQL 表（具有复合主键）导入 PARTITIONED Hive 表 Hive。该表有两年的数据，列的出发日期从 20120605 到 20140605，一天有数千条记录。我需要根据出发日期对数据进行分区。

版本：

阿帕奇 Hadoop - 1.0.4

阿帕奇蜂巢 - 0.9.0

Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0

据我所知，有3种方法：

MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
MySQL -> 分区 Hive 表
MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table 添加 PARTITION
1. 是我正在关注的当前痛苦的一个
2. 我读到在 Hive 和 Sqoop 的更高版本（？）中添加了对此的支持，但找不到示例
3. 语法要求将分区指定为键值对——在数百万条记录无法想到所有分区键值对的情况下是不可行的 3。

任何人都可以提供方法 2 和 3 的输入吗？

score 0 · Accepted Answer

我猜你可以创建一个配置单元分区表。

然后为它编写 sqoop 导入代码。

例如：

sqoop 导入 --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table \ --connect jdbc<mysql path>/DATABASE=xxxx\ --table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \ --fields-terminated-by ',' --lines-terminated-by '\ n'

score 0 · Accepted Answer

在将数据移动到分区表之前，您必须先创建分区表结构。而sqoop，不需要指定--hive-partition-key和--hive-partition-value，使用--hcatalog-table代替--hive-table。

马努

score 0 · Accepted Answer

如果这仍然是人们想要理解的东西，他们可以使用

sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee  --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date

补丁说明：

sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"

一些限制：

它只允许一个分区键/值
将分区键的类型硬编码为字符串
在 hive 0.7 中使用自动分区，我们可能希望将其调整为只有一个命令行选项作为键名，并使用 db 表中的该列进行分区。

mysql - 使用 Sqoop 从 MySQL 导入 Hive

3 回答 3

Related

Reference