6

我正在遵循Athena 入门指南并尝试解析我自己的 Cloudfront 日志。但是,这些字段没有被解析。

我用了一个小测试文件,如下:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02  07:57:45    LHR5    5001    86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /foo    404 -   Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Error   -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA==    d3g47gpj5mj0b.cloudfront.net    https421    0.076   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02  07:57:45    LHR5    1158241 86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo    Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Miss    oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA==    d3g47gpj5mj0b.cloudfront.net    https   419 0.440   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss

并使用此 SQL 创建表:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
  ) LOCATION 's3://test/athena-csv/'

但没有数据回来:

没有数据的雅典娜屏幕截图

我可以看到它返回 4 行,但应该排除前 2 行,因为它们以 # 开头,所以就像没有正确解析正则表达式一样。

难道我做错了什么?还是正则表达式错误(似乎不太可能,因为它在文档中,对我来说看起来不错)?

4

6 回答 6

8

这就是我最终的结果:

CREATE EXTERNAL TABLE logs (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)'
) LOCATION 's3://logs'

请注意,双反斜杠是有意的。

云端日志的格式在某些时候发生了变化,添加了protocol. 这处理较旧和较新的文件。

于 2017-03-23T19:12:56.573 回答
4

实际上,这里的所有答案都有一个小错误:第 4 个字段必须是 BIGINT,而不是 INT。否则,您的 >2GB 文件请求将无法正确解析。在与 AWS 业务支持进行长时间讨论后,正确的格式似乎是:

CREATE EXTERNAL TABLE your_table_name (
  `Date` DATE,
  Time STRING,
  Location STRING,
  SCBytes BIGINT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  UserAgent STRING,
  UriQS STRING,
  Cookie STRING,
  ResultType STRING,
  RequestId STRING,
  HostHeader STRING,
  Protocol STRING,
  CSBytes BIGINT,
  TimeTaken FLOAT,
  XForwardFor STRING,
  SSLProtocol STRING,
  SSLCipher STRING,
  ResponseResultType STRING,
  CSProtocolVersion STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://path_to_your_data_directory'
TBLPROPERTIES ('skip.header.line.count' = '2')
于 2017-05-16T14:12:29.860 回答
1

在用这个拉出我的头发并改进@CoderDans 回答之后:

秘诀是使用 \t 进行值分离,而不是使用 \s 进行正则表达式。

CREATE EXTERNAL TABLE IF NOT EXISTS mytablename (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_request_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` int,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol_version` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$'
) LOCATION 's3://mybucket/myprefix/';
于 2017-03-18T06:57:27.987 回答
0

该演示对我也不起作用。在玩了一会儿之后,我得到了以下工作:

CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$'
) LOCATION 's3://bucket/logs/';

用您的信息替换存储桶/日志和 dbname.table。出于某种原因,它仍在为带有# 的行插入空行,但我得到了其余的数据。

我认为下一步是尝试为用户代理或 cookie 制作一个。

于 2017-03-06T04:51:50.240 回答
0

Athena不区分大小写,并认为每一列都是小写的。尝试定义您的 Athena 表并改为使用小写字段名称进行查询。

于 2017-03-05T11:23:08.057 回答
0

这个对我有用。我从这里开始,但我必须添加“协议”列。

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
  `date` DATE,
  time STRING,
  location STRING,
  bytes BIGINT,
  request_ip STRING,
  method STRING,
  host STRING,
  uri STRING,
  status INT,
  referrer STRING,
  user_agent STRING,
  query_string STRING,
  cookie STRING,
  result_type STRING,
  request_id STRING,
  host_header STRING,
  request_protocol STRING,
  request_bytes BIGINT,
  time_taken FLOAT,
  xforwarded_for STRING,
  ssl_protocol STRING,
  ssl_cipher STRING,
  response_result_type STRING,
  http_version STRING,
  fle_status STRING,
  fle_encrypted_fields INT,
  protocol string
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/prefix/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
于 2019-09-27T17:56:44.257 回答