0

我是在 hive 上处理 json 数据的新手。我正在开发一个获取 json 数据并将其存储到配置单元表中的 spark 应用程序。我有一个这样的json:

Json 中的 Json

展开后看起来像这样:

等级制度

我能够将 json 读入数据帧并将其保存在 HDFS 上的某个位置。但是让 hive 能够读取数据是困难的部分。

例如,在我在网上搜索后,我尝试这样做:

STRUCT对所有 json 字段使用 ,然后使用column.element.

例如:

web_app_security将是STRUCT表内的列(类型)的名称,并且其中的其他 jsonconfig_web_cms_authentication, web_threat_intel_alert_external也将是 Structs(带有ratingrating_numeric作为字段)。

我尝试使用 json serde 创建表。这是我的表定义:

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float,  threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float,  data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>,  hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

我试图用 json serde 解析行。在我将一些数据保存到表后,当我尝试查询它时出现以下错误:

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

我不确定我是否以正确的方式做这件事。

我也对将数据存储到表中的任何其他方式持开放态度。任何帮助,将不胜感激。谢谢你。

4

1 回答 1

1

这是因为您将 ORC 作为存储 ( STORED AS orc) 和 JSON 作为 SerDe ( ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe') 混合使用,覆盖 ORC 的默认OrcSerdeSerDe,而不是输入 ( OrcInputFormat) 和输出 ( OrcOutputFormat) 格式。

您要么需要使用 ORC 存储而不覆盖其默认 SerDe。在这种情况下,请确保您的 Spark 应用程序写入 ORC 表,而不是 JSON。

或者,如果您希望数据以 JSON 格式存储,请JsonSerDe与纯文本文件一起用作存储 ( STORED AS TEXTFILE)。


Hive 开发人员指南解释了 SerDe 和存储如何工作 - https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

于 2017-07-16T23:34:56.320 回答