c - Lex（词法分析器）中正则表达式的大问题

Question

我有一些这样的内容：

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

而且我需要抓住title的双引号之间的所有内容。我的第一次尝试是这样的：

^(" "|\t)+"title"" "*=" "*"\"".+"\","

它抓住了第一个例子，但没有抓住其他两个。另一个有多行，这就是问题所在。我想在\n某处更改为允许多行的东西，如下所示：

^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","

但这无济于事，相反，它捕获了一切。

比我虽然，“我想要的是在双引号之间，如果我抓住所有东西，直到我找到另一个"后跟的东西,怎么办？这样我就可以知道我是否在标题的末尾，无论行数如何，比如这：

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

但这还有另一个问题......上面的例子没有它，但是双引号（"）可以在标题声明之间。例如：

title = "aaaaaaa \"X bbbbbb",

是的，它总是以反斜杠 ( \) 开头。

有什么建议可以解决这个正则表达式吗？

score 2 · Accepted Answer

匹配双引号中的字符串的经典正则表达式是：

\"([^\"]|\\.)*\"

在你的情况下，你会想要这样的东西：

"title"\ *=\ *\"([^\"]|\\.)*\"

PS：恕我直言，您在正则表达式中引用了太多引号，很难阅读。

score 0 · Accepted Answer

您可以使用开始条件来简化每个单独的模式，例如：

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

制作可执行文件：

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

运行：

$ ./parse_ini < input.txt

在哪里input.txt：

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

输出：

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

它'"'在标题周围替换'<|'为 '|>'. Also'\"'` 被标题内的 '""' 替换。

c - Lex（词法分析器）中正则表达式的大问题

2 回答 2

Related

Reference