parsing - 确定是什么减慢了词法分析器的编译速度

Question

我有一个词法分析器和解析器，用 OCaml 中的 sedlex 和 menhir 构建，用于解析电子表格公式。

词法分析器的以下部分在引用之前定义了路径+工作簿+工作表部分的正则表达式。例如，'C:\Users\Pictures\[Book1.xlsx]Sheet1'!的='C:\Users\Pictures\[Book1.xlsx]Sheet1'!A1:B2。

let first_Latin_identifier_character = [%sedlex.regexp? ('a'..'z') | ('A'..'Z') ]

let path_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ':' | '\x5C' (* \ *) | ' ' | '&' | '@']
let file_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ' ' | '.']
let file_suffix = [%sedlex.regexp? ".xls" | ".xlsm" | ".xlsx" | ".XLS" | ".XLSM" | ".XLSX" | ".xlsb" | ".XLSB"]

let sheet_identifier_character_in_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x5B' | '\x5D' | '\x27')]
let sheet_identifier_character_out_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x27' | '\x28' | '\x29' | '\x2B' | '\x2D' | '\x2F' | '\x2C' |'\x3D' | '\x3E' | '\x3C' | '\x3b')]

let lex_file = [%sedlex.regexp? (Star path_identifier_character), '[', (Plus file_identifier_character), file_suffix, ']']
let lex_file_wo_brackets = [%sedlex.regexp? (Star path_identifier_character), (Plus file_identifier_character), file_suffix]
let lex_sheet_in_quote = [%sedlex.regexp? Plus sheet_identifier_character_in_quote]
let lex_file_sheet_in_quote = [%sedlex.regexp? lex_file, lex_sheet_in_quote]

let lex_before = [%sedlex.regexp? 
                    ("'", lex_file_sheet_in_quote, "'!") |
                    ("'", lex_sheet_in_quote, "'!") |
                    (lex_sheet_out_quote, '!') |
                    (lex_file, "!") |
                    (lex_file_wo_brackets, "!") |
                    ("'", lex_file, "'!") |
                    ("'", lex_file_wo_brackets, "'!")]

没有最后4个lex_before(即(lex_file, "!") | (lex_file_wo_brackets, "!") | ("'", lex_file, "'!") | ("'", lex_file_wo_brackets, "'!"))，项目编译(by)的总时间ocamlc是3分30秒(耗时是编译的lexer.ml)。加上这 4 个案例，编译的总时间是 13 分 40 秒。需要时间的总是编译lexer.ml。

有谁知道我们如何确定是什么减慢了编译速度？

我编写命名正则表达式的方式有什么问题会减慢编译速度吗？

score 0 · Accepted Answer

没有更多信息很难说，但最可能的情况是 sedlex 无法将正则表达式编译成代码，这通常意味着您的表达式中有一些复杂的难以区分的情况。

例如，词法分析器会不断记住它是否可以在 alex_file或 alex_file_wo_brackets中，以及它可以在其中的什么位置。

以这个词.x.x.xlx.xls.xls.xls!为例。它是一个没有括号的有效文件（路径为空）。在知道你在哪里的同时解码它需要构建一个称为确定性有限自动机的图。构建它可能需要成倍的时间，特别是如果你有那些棘手的情况。

如果您有机会简化您的语言（为每个部分添加一个前缀，以便它至少知道它在哪个分支中），您可以将编译时间除以六。编写可以更快编译的更清晰的表达式可能需要一些培训，但这是值得的。

此外，切换到ocamlc.opt(或ocamlopt.opt) 将使事情进展得更快。

score 0 · Accepted Answer

当ocamlopt在机械生成的代码上运行时，我发现如果我指定-linscan选项，事情会变得更快，它要求线性扫描寄存器分配器。我认为当有很多小函数时它会有所帮助——比一个人合理编写的要多，但不会超过程序可以机械地生成。

你没有说你是否正在编译为本机代码，但如果是这样，这可能会有所帮助。这对我的项目绝对有帮助。

parsing - 确定是什么减慢了词法分析器的编译速度

2 回答 2

Related

Reference