hadoop 版本:Hadoop 2.6.0-cdh5.12.2 配置单元版本:Hive 1.1.0-cdh5.12.2
考虑两个表: products - 存储产品 ID 和有关产品活动的其他详细信息 - 存储 user_id,product_id 告诉哪个用户购买了哪个产品和其他交易详细信息。
在创建这些表之前,我使用以下命令添加了 SerDe JAR: add jar /home/ManojKumarM_R/json-serde-1.3-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS products (id string,name string,reseller
string,category string,price Double,discount Double,profit_percent Double)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location
"/user/ManojKumarM_R/ProductsMergeEnrichOut";
/user/ManojKumarM_R/ProductsMergeEnrichOut 中的样本数据
{"Id":"P101", "Name":"Round Tee", "Reseller":"Nike", "Category":"Top Wear", "Price":2195.03, "Discount":21.09, "Profit_percent" :23.47}
{"Id":"P102", "Name":"Half Shift", "Reseller":"Nike", "Category":"Top Wear", "Price":1563.84, "Discount":23.83, "Profit_percent" :17.12}
CREATE EXTERNAL TABLE IF NOT EXISTS activity (product_id string,user_id
string,cancellation boolean ,return boolean,cancellation_reason
string,return_reason string, order_date timestamp, shipment_date timestamp,
delivery_date timestamp , cancellation_date timestamp, return_date
timestamp) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location
"/user/ManojKumarM_R/ActivityMergeEnrichOut/";
/user/ManojKumarM_R/ActivityMergeEnrichOut/ 中的样本数据
{"Product_id":"P117", "User_id":"U148", "Cancellation":"TRUE", "Return":"NA", "Cancellation_reason":"重复产品", "Return_reason":"NA", "Order_date":"2016-02-12", "Shipment_date":"NA", "Delivery_date":"NA", "Cancellation_date":"2018-05-20", "Return_date":"NA"}
{"Product_id":null, "User_id":"U189", "Cancellation":"FALSE", "Return":"FALSE", "Cancellation_reason":"NA", "Return_reason":"NA", "Order_date" :"2017-04-22", "Shipment_date":"2017-05-05", "Delivery_date":"2017-09-09", "Cancellation_date":"NA", "Return_date":"NA"}
表创建成功,
select * from products;
&
select * from activity;
查询工作得很好,因此表示在选择查询期间选择了 SerDe JAR。
但是,当我运行以下连接查询时:我想将这两个表连接到一个公共列上,即 Product Id
SELECT a.user_id, p.category FROM activity a JOIN products p
ON(a.product_id = p.Id);
它失败并显示以下消息
执行日志在:/tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log 2018-10-10 12:47:43 开始启动本地任务来处理map join;最大内存 = 2058354688 执行失败,退出状态:2 获取错误信息
任务失败!任务 ID:第 5 阶段
登录 /tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log
2018-10-10 12:47:43,984 错误 [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInProcess(398)) - Hive 运行时错误:映射本地工作失败 org.apache.hadoop.hive.ql.metadata.HiveException : 异常 java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe at org.apache.hadoop.hive.ql 失败。 plan.TableDesc.getDeserializerClass(TableDesc.java:73)
这表示 Hive 无法找到 JsonSerDe JAR,即使我在该 hive 会话期间添加了 JAR 并且 selct 查询工作正常。如果有人解决了类似的问题,请告诉我,我不确定 Hive 在 JOIN 操作期间是否在不同的目录中查找 JAR。