我已经设置了一个交互式配置单元会话并将 apache weblog 日期直接从 s3 存储桶加载到表中:
DROP TABLE apachelog;
CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION 's3n://OperationOverkill/';
然后我可以像这样成功地从中选择:
SELECT * FROM apachelog LIMIT 5;
但是计数(或任何需要实际 map-reduce 的东西都不会:
SELECT COUNT(host) FROM apachelog;
错误信息:
Job Submission failed with exception 'java.io.IOException(cannot find dir = s3n: //OperationOverkill/access_clickkiller_12-08-08.log in pathToPartitionInfo: s3n ://OperationOverkill/)'
我用谷歌搜索并在 AWS Support 论坛上发现了一个类似的问题, 但我希望从 SO 获得更快的指示/帮助。