2

我正在尝试将深度嵌套的 JSON 数据加载到配置单元表中。让我告诉你们到目前为止我尝试了什么。

1- 我有 JSON 文件,它们像结构数组一样深度嵌套,又具有结构字段。

2- 我成功地将这个 json 数据加载到 Spark Data 框架中并且能够看到架构。此外,我使用来自 spark shell 的以下命令成功将此数据帧存储为配置单元表。

new org.apache.spark.sql.hive.HiveContext(sc).read.json(
"/user/alpha/test.json").saveAsTable("mywarehouse.patent_data_2001");

但是当我尝试任何查询时select * from patent_data_2001 limit 1

它给了我以下错误

FAILED: IllegalArgumentException Error: type expected at the position 4339 of 'array<struct<id:string>>:struct<claim:array<struc
.
.
.
.
.
,wi:string>,nb_file:string>>:struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>>' but 'stru
c' is found.

3-我尝试使用来自 spark shell 的以下命令使用 Hive serde 而不是 Spark SQL spllied jar

 hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")

还是同样的错误。

它在 hive 仓库中创建表并加载数据,但是当我尝试查询表甚至描述表时它给了我错误。

4-假设此问题可能与 Spark SQL 和 Hive 兼容性有关。自从创建表以来,我就想到了修复模式,并且只有表模式有问题。我尝试手动修复架构,是的,这是一个漫长而真正耗时的过程,最终修复了架构。Create table statement我使用如下所示再次手动创建了配置单元表。

CREATE TABLE `patent_data_2001`(
  `abstract` array<struct<id:string>> COMMENT '', 
  `claims` struct<claim:array<struct<id:string,num:string>>,id:string> COMMENT '', 
  `country` string COMMENT '', 
  `date_produced` string COMMENT '', 
  `date_publ` string COMMENT '', 
  `description` string COMMENT '', 
  `drawings`      struct<figure:array<struct<id:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,orientation:string,wi:string>,num:string>>,id:string> COMMENT '', 
  `dtd_version` string COMMENT '', 
  `file` string COMMENT '', 
  `id` string COMMENT '', 
  `lang` string COMMENT '', 
  `status` string COMMENT '', 
  `table_external_doc` array<string> COMMENT '', 
  `us_bibliographic_data_grant` struct<application_reference:struct<appl_type:string,document_id:struct<country:string,`date`:string,doc_number:string>>,assignees:struct<assignee:array<struct<addressbook:struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string,role:string>,first_name:string,last_name:string,orgname:string,role:string>>>,classification_locarno:struct<edition:string,main_classification:string>,classification_national:array<struct<country:string,main_classification:string>>,classifications_cpc:struct<further_cpc:struct<classification_cpc:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>,combination_set:array<struct<combination_rank:array<struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>,rank_number:string>>,group_number:string>>>,main_cpc:struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,classifications_ipcr:struct<classification_ipcr:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_level:string,classification_status:string,classification_value:string,generating_office:struct<country:string>,ipc_version_indicator:struct<`date`:string>,main_group:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,examiners:struct<assistant_examiner:struct<first_name:string,last_name:string>,primary_examiner:struct<department:string,first_name:string,last_name:string>>,invention_title:string,number_of_claims:string,pct_or_regional_filing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>,us_371c124_date:struct<`date`:string>,us_371c12_date:struct<`date`:string>>,pct_or_regional_publishing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,priority_claims:struct<priority_claim:array<struct<country:string,`date`:string,doc_number:string,kind:string,sequence:string>>>,publication_reference:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,rule_47_flag:string,us_application_series_code:string,us_botanic:struct<latin_name:string,variety:string>,us_field_of_classification_search:struct<classification_national:array<struct<country:string,main_classification:string>>>,us_parties:struct<agents:struct<agent:array<struct<addressbook:array<struct<address:struct<country:string>,first_name:string,last_name:string,orgname:string>>,rep_type:string,sequence:string>>>,inventors:struct<inventor:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string>>,designation:string,sequence:string>>>,us_applicants:struct<us_applicant:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string>>,app_type:string,applicant_authority_category:string,designation:string,sequence:string>>>>,us_references_cited:struct<us_citation:array<struct<classification_national:array<struct<country:string,main_classification:string>>>>>,us_related_documents: struct < continuation: array< struct< relation: struct< child_doc: struct< document_id: struct < country: string,   `date`: string, doc_number: string > >, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string>>>>, continuation_in_part: array< struct < relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string,doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, division: array< struct< relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string >, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, reissue: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct < country: string,`date`:string,doc_number: string > ,parent_grant_document: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string >>, parent_status: string >>>>, related_publication: array< struct < document_id: array < struct< country: string,`date`:string,doc_number: string, kind: string>>>>, substitution: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_doc: struct< document_id: struct < country: string,`date`:string, doc_number: string >, parent_status: string >>>>, us_provisional_application: array< struct < document_id: struct < country: string,`date`:string, doc_number: string >>>, us_term_of_grant: struct < disclaimer: array < struct< text: string>>>>>, 
  `us_chemistry` array<struct<cdx_file:string,idref:string,mol_file:string>> COMMENT '', 
  `us_claim_statement` string COMMENT '', 
  `us_math` array<struct<idrefs:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,wi:string>,nb_file:string>> COMMENT '', 
  `us_sequence_list_doc` struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>> COMMENT '')
  ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001';

但是我又遇到了同样的错误。

5-我尝试使用下面链接中列出的 Serde

https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/proofpoint/hive-serde
http://www.congiu.net/hive-json-serde/1.3.6/

但一点运气都没有。

6-在研究时我发现这可能是这些 JIRA 中提到的问题

**ArrayIndexOutOfBounds exception for deeply nested structs**
https://issues.apache.org/jira/browse/HIVE-3253

**Support nested structs over 24 levels.**
https://issues.apache.org/jira/browse/HIVE-9500

我用下面的 serde 运行表创建

ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true' )

但我得到了同样的错误,没有其他细节。

我无法弄清楚为什么会发生这些错误以及为什么 hive 在让我成功创建表时不让我查询表。

任何帮助或建议都会非常有帮助。请帮忙。

非常感谢。

4

0 回答 0