6

我正在从下面的链接 http://code.google.com/p/hive-json-serde/wiki/GettingStarted尝试 JSON-SerDe 。

         CREATE TABLE my_table (field1 string, field2 int, 
                                     field3 string, field4 double)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;

我已将 Json-SerDe jar 添加为

          ADD JAR /path-to/hive-json-serde.jar;

并将数据加载为

LOAD DATA LOCAL INPATH  '/home/hduser/pradi/Test.json' INTO TABLE my_table;

并成功加载数据。

但是当查询数据为

从 my_table 中选择 *

我从表中只得到一行

数据1 100 更多数据1 123.001

Test.json 包含

{"field1":"data1","field2":100,"field3":"more data1","field4":123.001} 

{"field1":"data2","field2":200,"field3":"more data2","field4":123.002} 

{"field1":"data3","field2":300,"field3":"more data3","field4":123.003} 

{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

哪里有问题?为什么当我查询表格时只有一行而不是 4 行。在 /user/hive/warehouse/my_table中包含所有 4 行!


hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar

hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
>                                 field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
>   "field1"="$.field1",
>   "field2"="$.field2",
>   "field3"="$.field3",
>   "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds

hive> LOAD DATA LOCAL INPATH  '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds

hive> select * from my_table;
OK
data1   100     more data1      123.001
Time taken: 0.17 seconds

我已经发布了 test.json 文件的内容。所以你可以看到查询只产生一行

data1   100     more data1      123.001

我已将 json 文件更改为 employee.json,其中包含

{ “firstName”:“Mike”,“lastName”:“Chepesky”,“employeeNumber”:1840192 }

并更改了表,但是当我查询表时它显示空值

hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar

hive> create EXTERNAL table employees_json (firstName string, lastName string,        employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds


hive> load data local inpath '/home/hduser/pradi/employees.json' into table     employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds


 hive>select * from employees_json;
  OK
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
Time taken: 0.194 seconds
4

4 回答 4

2

如果有疑问,如果没有日志,有点难以判断发生了什么(请参阅入门)。只是一个快速的想法 - 你可以试试它是否可以这样工作WITH SERDEPROPERTIES

CREATE EXTERNAL TABLE my_table (field1 string, field2 int, 
                                field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
  "field1"="$.field1",
  "field2"="$.field2",
  "field3"="$.field3",
  "field4"="$.field4" 
);

您可能还想尝试一下 ThinkBigAnalytics的分叉。

更新:原来 Test.json 中的输入是无效的 JSON,因此记录被折叠。

有关详细信息,请参阅答案https://stackoverflow.com/a/11707993/396567

于 2013-02-05T12:43:41.813 回答
0
  1. 首先,您必须在http://jsonlint.com/上验证您的 json 文件, 然后将您的文件设为每行一行并删除 [ ]。行尾的逗号是强制性的。

    [{"field1":"data1","field2":100,"field3":"更多数据1","field4":123.001}, {"field1":"data2","field2":200,"field3" :"more data2","field4":123.002}, {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}, {"field1":" data4","field2":400,"field3":"更多数据4","field4":123.004}]

  2. 在我的测试中,我从 hadoop cluster 添加了 hive-json-serde-0.2.jar ,我认为 hive-json-serde-0.1.jar 应该没问题。

    添加 JAR hive-json-serde-0.2.jar;

  3. 创建你的表

    CREATE TABLE my_table (field1 string, field2 int, field3 string, field4 double) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;

  4. 加载你的 Json 数据文件,这里我从 hadoop 集群加载它而不是从本地加载

    将数据输入路径“Test2.json”加载到表 my_table 中;

我的测试

于 2017-03-28T22:34:37.017 回答
0

对于基于 cwiki/confluence 的 json 解析,我们需要遵循一些步骤

  1. 需要下载 hive-hcatalog-core.jar

  2. hive> 添加 jar /path/hive-hcatalog-core.jar

  3. 创建表 tablename(colname1 datatype,.....) 行 formatserde'org.apache.hive.hcatalog.data.JsonSerDe' 存储为 ORCFILE;

  4. 创建表中的 colname 和 test.json 中的 colname 必须相同,否则它将显示空值 希望它会有所帮助

于 2018-02-21T09:26:35.783 回答
0

我解决了类似的问题-

  1. 我从 - [ http://www.congiu.net/hive-json-serde/1.3.8/hdp23/json-serde-1.3.8-jar-with-dependencies.jar]

  2. 在 Hive CLI 中运行命令 - 添加 jar /path/to/jar

  3. 使用创建的表 -
create table messages (
    id int,
    creation_date string,
    text string,
    loggedInUser STRUCT<id:INT, name: STRING>
)
row format serde "org.openx.data.jsonserde.JsonSerDe";
  1. 这是我的 JSON 数据 -
{"id": 1,"creation_date": "2020-03-01","text": "I am on cotroller","loggedInUser":{"id":1,"name":"API"}}
{"id": 2,"creation_date": "2020-04-01","text": "I am on service","loggedInUser":{"id":1,"name":"API"}}
  1. 使用 - 在表中加载数据 -
LOAD DATA LOCAL INPATH '${env:HOME}/path-to-json'
OVERWRITE INTO TABLE messages;
  1. select * from messages;
1   2020-03-01    I am on cotroller   {"id":1,"name:"API"}
2   2020-04-01    I am on service     {"id":1,"name:"API"}
于 2020-05-07T13:07:46.690 回答