regex - 为什么我的 tcl 正则表达式与 perl 相比表现如此糟糕？

Question

set fr [open "x.txt" r]
set fw [open "y.txt" w]
set myRegex {^([0-9]+) ([0-9:]+\.[0-9]+).* ABC\.([a-zA-Z]+)\[([0-9]+)\] DEF\(([a-zA-Z]+)\) HIJ\(([0-9]+)\) KLM\(([0-9\.]+)\) NOP\(([0-9]+)\) QRS\(([0-9]+)\)}
while { [gets $fr line] >= 0 } {
   if { [regexp $myRegex $line match x y w z]} {
       if { [expr $D >> 32] == [lindex $argv 0]} {
         puts $fw "$x"
       }
   }
}
close $fr $fw

上面的 tcl 代码需要永远（32 秒或更长时间）才能执行。在 perl 中执行基本相同的操作只需 3 秒或更短的时间。我知道 perl 对于某些正则表达式的性能更好，但相比之下，tcl 的性能真的会这么差吗？差10倍以上？

顺便说一句，我正在使用 TCL 8.4

以下是使用正则表达式和相同正则表达式的简化版本运行上述代码的指标

32s is the time taken for the above code to execute
22s after removing: QRS\(([0-9]+)\) 
17s after removing: NOP\(([0-9]+)\) QRS\(([0-9]+)\)
13s after removing: KLM\(([0-9\.]+)\) NOP\(([0-9]+)\) QRS\(([0-9]+)\)
9s  after removing: HIJ\(([0-9]+)\) KLM\(([0-9\.]+)\) NOP\(([0-9]+)\) QRS\(([0-9]+)\)
6s  after removing: DEF\(([a-zA-Z]+)\) HIJ\(([0-9]+)\) KLM\(([0-9\.]+)\) NOP\(([0-9]+)\) QRS\(([0-9]+)\)}

score 6 · Accepted Answer

问题是你在那个 RE 中有很多捕获和回溯；这种特殊的组合在 Tcl RE 引擎中效果不佳。一方面原因是 Tcl 使用了与 Perl 完全不同类型的 RE 引擎（尽管它对其他 RE 工作得更好；这个领域很重要）。

.*如果可以，请尽早从 RE中摆脱它：

^([0-9]+) ([0-9:]+\.[0-9]+).* ABC\.([a-zA-Z]+)\[([0-9]+ )\] DEF\(([a-zA-Z]+)\) HIJ\(([0-9]+)\) KLM\(([0-9\.]+)\) NOP\(( [0-9]+)\) QRS\(([0-9]+)\)
                           ^^

这才是麻烦的真正原因。替换为更准确的内容，例如：

(?:[^A]|A[^B]|AB[^C])*

此外，将 RE 中的捕获组数量减少到您需要的数量。您可能可以将代码整体转换为：

set fr [open "x.txt" r]
set fw [open "y.txt" w]
set myRegex {^([0-9]+) (?:[0-9:]+\.[0-9]+)(?:[^A]|A[^B]|AB[^C])* ABC\.(?:[a-zA-Z]+)\[([0-9]+)\] DEF\((?:[a-zA-Z]+)\) HIJ\((?:[0-9]+)\) KLM\((?:[0-9\.]+)\) NOP\((?:[0-9]+)\) QRS\((?:[0-9]+)\)}
while { [gets $fr line] >= 0 } {
    # I've combined the [if]s and the [expr]
    if { [regexp $myRegex $line -> A D] && $D >> 32 == [lindex $argv 0]} {
        puts $fw "$A"
    }
}
close $fr $fw

另请注意，这if { [expr ...] }是一种可疑的代码气味，任何没有大括号的表达式也是如此。（有时在非常特殊的情况下是必要的，但几乎总是表明代码过于复杂。）

regex - 为什么我的 tcl 正则表达式与 perl 相比表现如此糟糕？

1 回答 1

Related

Reference