我正在尝试使用 EMR 作业从 S3 中包含稀疏字段的 JSON 文件导入数据,例如 ios_os 字段和 android_os 但只有一个包含数据。有时数据为空,有时为空字符串,当尝试插入 DynamoDB 时出现错误(尽管我能够插入一些稀疏填充的记录):
"AttributeValue 不能包含空字符串" {"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
我过滤掉了具有空字符串“”的列,但我仍然收到该错误。我假设它是“属性”:导致此(或两者)的空值。我假设为了让它正常工作,在去 DynamoDB 时这些值不应该存在吗?
有没有办法通过 JSONSerde 或 Hive 与 DynamoDB 表的交互告诉 Hive 忽略空字符串属性值。
这是 Hive SQL 模式和插入命令的示例:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';