我想解析由 fidonet mailer binkd 生成的日志文件,它们是多行的,而且更糟糕 - 混合:多个实例可以写入一个日志文件,例如:
27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
+ 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
- 27 Dec 16:52:41 [2484] SYS BBSName
- 27 Dec 16:52:41 [2484] ZYZ First LastName
- 27 Dec 16:52:41 [2484] LOC City, Country
- 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
- 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
- 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
+ 27 Dec 16:52:43 [2484] addr: 2:1234/56.78@fidonet
- 27 Dec 16:52:43 [2484] OPT NDA CRYPT
+ 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
+ 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
- 27 Dec 16:52:43 [2484] TRF 0 0
*+ 27 Dec 16:52:43 [1520] done (from 2:456/78@fidonet, OK, S/R: 0/0 (0/0 bytes))*
+ 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
+ 27 Dec 16:52:43 [2484] pwd protected session (MD5)
- 27 Dec 16:52:43 [2484] session in CRYPT mode
+ 27 Dec 16:52:43 [2484] done (from 2:1234/56.78@fidonet, OK, S/R: 0/0 (0/0 bytes))
因此,日志文件不仅是多行的,每个会话的行数不可预测,而且还可以在其间混合多条记录,例如会话 1520 在会话 2484 的中间完成。hadoop 中解析此类的正确方向是什么一份文件?还是我应该逐行解析,然后以某种方式将它们合并到一个记录中,然后稍后使用另一组作业将这些记录写入 SQL 数据库?
谢谢。