1

我使用 regex101 网站来验证我的正则表达式:

([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"

它适用于下面的日志

66.240.70.141 - - [01/Mar/2018:06:16:46 +0000] "GET /example.download.handler.com/products/01/00/item/116314/8/002394857_2BB.jpg HTTP/1.1" 200 41710 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB30P) AppleWebKit/536.37 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" "C0T1_19610|3881001|"

但是相同的表达式在 hive 上不起作用:

CREATE EXTERNAL TABLE `web_logs_test`(   
`ip_address`  string COMMENT '',   
`date_string` string COMMENT '',   
`request`     string COMMENT '', 
`status`      string COMMENT '',   
`bytes`       string COMMENT '',   
`referer`     string COMMENT '',   
`user_agent`  string COMMENT '',   
`cookie`      string COMMENT ''
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (  
'input.regex'='([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) "(.*?)" "(.*?)" "(.*?)" "(.*?)"'
)
STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/weblogs/data'

如果有人知道,请帮助我。

提前致谢。

4

1 回答 1

0
CREATE EXTERNAL TABLE web_logs (
  ip_address STRING,
  date_string STRING,
  request STRING,
  status STRING,
  bytes STRING,
  referer STRING,
  user_agent STRING,
  cookie STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = "^([\\d.]+) \\S+ \\S+ \\[(.+?)\\] \\\"(.+?)\\\" (\\d{3}) (\\d+) \\\"(.+?)\\\" \\\"(.+?)\\\" \\\"SESSIONID=(\\d+)\\\"\\s*"
)
LOCATION '/file_location/web_logs';
于 2019-11-05T12:18:06.257 回答