1

我正在寻找一些帮助编写一些 Perl 代码来对日志文件进行排序。

一般来说,我是编码和 perl 的相对新手!

我需要尽可能只使用核心 perl 模块编写我的代码,但如果这被证明是不可能的,我对 CPAN 模块持开放态度。日志文件包含一个记录的消息列表,需要按顺序重新排列。应该很简单,但是有很多陷阱,这给我带来了如何设计数据结构的麻烦。输入文件格式为 CSV,输出需要与时间戳顺序的消息相同,并首先与第一个消息部分组合在一起的串联消息。

陷阱

  1. 消息需要按时间戳排序。
  2. 如果消息已被拆分为多行,则在最终字段“(消息参考 1 的第 3 部分中的第 1 部分)”中将具有类似以下内容。对于特定的消息引用,所有部分都需要按顺序排列,因此第 1 部分,然后是第 2 部分,然后是第 3 部分,等等。
  3. 该字段开头的十六进制数字告诉我它是 8 位还是 16 位引用,并且具有相同引用号的 8 位引用与具有相同编号的 16 位引用不匹配(作为副本)。所以我需要考虑到这一点。
  4. 消息部分可能会丢失,所以我们可能只得到第 1 部分和第 2 部分,共 3 部分。
  5. 重复的消息参考号是可能的,因此每个消息参考都需要绑定到 from 字段以赋予其唯一标识。
  6. 即使使用(3)中的唯一标识,仍然可能随着时间的推移重复(因为在它们重置之前只有这么多消息参考号),所以我需要检查收到的最后一部分的时间以及重复的消息参考。如果消息部分之间的间隔超过 3 天,那么我可以将其视为新消息。
  7. 最后,日志文件中可能有数十万行需要重新排序,因此将这些全部加载到内存中可能不是一种选择。

如果我只是放一些示例输入数据,然后它需要如何输出,这可能是最好的。

输入数据

#message uniqueID,From,To,Time,flag,content,IP,concatenation info   
1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",  

输出数据

#message uniqueID,From,To,Time,flag,content,IP,concatenation info   
1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",
13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"

到目前为止我所做的事情是

  1. 将时间字段转换为纪元时间以使任何比较更容易
  2. 可以读入(和写出文件)。
  3. 可以解析所有 CSV 列。
  4. 可以将串联信息拆分成部分,即 8 位或 16 位引用的位置、部分编号、总数和引用 ID。

现在我坚持想出有效过滤和排序数据的最佳方法。我尝试过使用哈希并将文件首先加载到内存中,以便我可以对特定的消息引用进行排序,但我不确定这是否适用于大文件。

然后我考虑逐行阅读它,但我可能会遇到一个问题,即第二行包含连接 SMS 的第一部分,我们可能要到文件的最后才得到后续部分,所以我想也许这也不是一个好主意。

我也想到了一个数据库,但我认为在需要运行的系统上设置太复杂了。另一种选择可能是编写一个包并将复杂的结构存储为一个对象?也许我把事情复杂化了?我的大脑肯定会变得糊状!

无论如何,任何想法或指导将不胜感激。

希望以上内容很清楚,但如果您有任何问题,请询问我。

谢谢,威尔

4

2 回答 2

2

如果正确分解,我认为这个问题不会太复杂。

在我看来,您的排序程序将包含以下阶段:

  1. 从每一行中提取相关信息(时间戳和连接信息)。
  2. 按消息引用对行进行分组,这可以通过缓存高效地完成。
  3. 按时间戳对组进行排序。
  4. 将组展平为原始线条。

施瓦茨变换

在 Perl 中排序时,Schwartzian 是一种常见的模式它加快了排序,其中排序索引必须从实际排序的数据中提取,通过提取该数据一次,而不是每次比较。它也可以被描述为decorate-sort-undecorate。

示例:按长度对字符串进行排序。请注意,在这种情况下,幼稚的实现实际上会更好。

my @words = qw( aaa b cccc );
my @sorted_words = 
    map  { $_->[1]             } # flatten
    sort { $a->[0] <=> $b->[0] } # sort by first field (length)
    map  { [ length $_, $_ ]   } # decorate: return arrayref with key and data
    @words;
print "[@sorted_words]\n"; # prints "[b aaa cccc]"

为您的任务记住这种模式会很好

1.提取

你已经做到了。对于每一行,我们输出一个数组引用或具有以下字段的类似内容:

0: timestamp (in epoch)
1: part no            \
2: total parts        | these are undef if no concat info is present
3: message reference  /
4: The unmodifed line

对于 CSV 提取,您应该使用Text::CSV, 来计算纪元,您应该查看DateTime

2.分组

我们以散列的形式定义缓存,其中消息引用作为键,组作为值。组是上面指定的提取格式的arrayref,但可能在第5 位及以后的位置包含更多行(即每个标记的行都是一个组)。

对于收到的每个标记行,我们执行以下过程:

# pseudocode
# this is how I understood your requirements,
# but it may be wrong. The general principle still holds
# (you may need to choose a different key)
IF the line doesn't have part information, THEN
    pass it on immediately.
ELSE
    IF the hash has an entry for our message reference, THEN
        IF the timestamp of the present group is too old, THEN
            pass on the existing group.
            Add our line for this key.
        ELSE
            Update the group with our line,
            adding the original line (at position 3 + part no),
            but not the metadata to the group.
            IF the group is made complete, THEN
                pass it on immediately,
                delete this entry from the hash.
    ELSE
        Add the line as a group.
        Make sure the content is at position 3 + part no, to allow easy updating.

在没有新行出现后,我们将散列中的每个剩余值传递到下一个阶段。

要意识到的重要一点是,您不必将所有行都保存在内存中,而只保留不完整的组。

有趣的 Perl 函数是exists $hash{element}delete $hash{element}。这delete对于节省内存可能很重要。

3. 排序

我们只需按时间戳对每个元素进行排序。如果系统无法处理的总数据过多,我们可以使用一个技巧:

  1. 对较小的数据块进行排序,将它们保存到文件中。
  2. 打开每个文件。
  3. 从每个文件中加载第一项
  4. Do-While 至少一个文件有剩余项目:
    1. 对所有加载的项目进行排序
    2. 传递第一个结果元素。
    3. 从当前第一个元素来自的文件中加载下一项
  5. 以正确的顺序传递其他(已经加载的)项目

然而,这是耗时的。

4. 展平

在这里,我们只接收排序和分组的项目。我们所要做的就是以正确的顺序输出包含的行。

于 2013-02-28T14:00:50.853 回答
0

我将分两个阶段执行此操作:组合消息部分和排序。这应该会稍微简化问题。

首先,我将使用外部排序实用程序(例如 GNU 排序工具)按消息编号排序。这至少会将具有相同消息编号的所有部分组合在一起。一个简单的sort <inputfile >outputfile将做你需要的。您真正感兴趣的只是让所有以开头的部分,例如,371,"...彼此相邻。

然后,您可以编写一个 Perl 程序来读取输出并累积具有相同消息编号的行。当您看到不同的消息编号时,过滤您积累的行以组合来自不同部分的消息。并将该记录写入文件。您可能希望以更易于排序的形式编写输出。也许通过在记录前面输出您要排序的字段,必要时补零以简化排序。

完成后,您将拥有一个每行包含一条记录的文件,如果您正确构建了记录,则可以再执行一次sort <inputfile >outputfile以按照您想要的顺序获取数据。

这也大大简化了您的编程:您不必担心为数据编写自定义排序。相反,您编写一个相对简单的 Perl 程序来转换数据,以便更容易通过现有工具进行排序。

于 2013-02-28T03:41:46.553 回答