这里用到的nginx日志是网站的访问日志,比如:
180.173.250.74 - - [08/Jan/2015:12:38:08 +0800] "GET /avatar/xxx.png HTTP/1.1" 200 968 "http://www.iteblog.com/archives/994" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
这条日志里面有九列,每列之间是用空格分割的,每列的含义分别是客户端访问IP、用户标识、用户、访问时间、请求页面、请求状态、返回文件的大小、跳转来源、浏览器UA。如果想用一般的方法来解析这条日志的话,有点困难。但是我们可以如果我们用正则表达式去匹配这九列的话还是很简单的:
([^ ]*) ([^ ]*) ([^ ]*) ([.*]) (".*?") (-|[0-9]*) (-|[0-9]*) (".*?") (".*?")
这样就可以匹配出每一列的值。而在Hive中我们是可以指定输入文件的解析器(SerDe)的,并且在Hive中内置了一个org.apache.hadoop.hive.contrib.serde2.RegexSerDe正则解析器,我们可以直接使用它。所以整个的建表语句可以这么写
CREATE TABLE logs( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([.*]) (".*?") (-|[0-9]*) (-|[0-9]*) (".*?") (".*?")", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE;
将日志放置到这个表的目录下,gz格式的和未压缩的格式都可以直接被Hive解析。所以我可以用下面已经查询每小时的访问量超过20的IP:
hive> select substring(time, 2, 14) date ,host, count(*) as count from logs group by substring(time, 2, 14), host having count > 20 sort by date, count; 29/Dec/2014:00 47.18.236.106 24 29/Dec/2014:02 81.215.34.45 70 29/Dec/2014:04 66.249.64.18 23 29/Dec/2014:04 66.249.64.22 24 29/Dec/2014:09 119.145.14.213 44 29/Dec/2014:09 113.90.78.63 52 29/Dec/2014:10 106.39.255.133 26