regex - Flex 中的非贪婪正则表达式匹配

Question

我刚开始使用 Flex，似乎无法弄清楚如何匹配以下 Expression ：

"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat

但我想要一个非贪婪匹配，输出如下：

Output:
Dog Ca Cat

如何在 Flex 上实现这一点？

编辑

尝试了以下方法：

%%
Dog.*Cat/.*Cat  printf("Matched : ||%s||", yytext);
dog.*cat        printf("Matched : ||%s||", yytext);
dOg[^c]*cAt     printf("Matched : ||%s||", yytext);
DOG.*?CAT       printf("Matched : ||%s||", yytext);
%%

输入：

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT

输出：

Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||

还收到警告：

lex4.l:2: warning, dangerous trailing context

弹性版本：

flex 2.5.35 Apple(flex-31)

score 8 · Accepted Answer

在使用 lex/flex 工具时，这是一个很常见的问题，会困扰初学者（有时甚至是非初学者）。该问题有两种解决方案，需要工具的两种不同高级功能。短语 likedog ... cat与在各种编程语言中匹配注释的问题非常相似，例如C注释形式/* ... */甚至'comment' ... 'tnemmoc'. 这些具有与您的示例完全相同的特征。考虑以下C代码：

/* This is a comment */ "This is a String */"

一个贪婪的词法匹配会匹配错误的注释终止符（顺便说一句，这是对学生词法分析器的一个很好的测试！）。

在几门大学编译器课程中有建议的解决方案。在这里（曼彻斯特）可以很好地解释它。其中引用了几本也涵盖了这些问题的好书：

J.Levine、T.Mason 和 D.Brown：Lex 和 Yacc（第 2 版）
MELesk & E.Schmidt：Lex - 词法分析器生成器

所描述的两种技术是使用开始条件来明确指定状态机，或手动输入以直接读取字符。

对于您的cat ... dog问题，可以通过以下方式对其进行编程：

开始条件

在这个解决方案中，我们需要几个状态。关键字dogCauses 使其进入DOG一直持续到c遇到字母的状态。然后进入LETTERC必须跟一个字母的状态a，如果不是，DOG状态继续；一个字母a会导致进入CAT状态，后面必须跟一个字母t，导致整个短语匹配并返回到INITIAL状态。这yymore会导致保留整个dog ... cat文本以供使用。

%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%

<INITIAL>{d}{o}{g} {
        BEGIN(DOG);
        printf("DOG\n");
        yymore();
        }
<DOG>[^cC]*{c} {
        printf("C: %s\n",yytext);
        yymore();
        BEGIN(LETTERC);
        }
<LETTERC>{a} {
       printf("A: %s\n",yytext);
       yymore();
       BEGIN(CAT);
      }
<LETTERC>[^aA] {
        BEGIN(DOG);
        yymore();
        }
<CAT>{t} {
        printf("CAT: %s\n",yytext);
        BEGIN(INITIAL);
        }
<CAT>[^tT] {
        BEGIN(DOG);
        yymore();
        }
<INITIAL>{ws}  /* skip */ ;

手动输入

手动输入法仅匹配起始短语dog和输入C代码，该代码会吞噬输入字符，直到cat遇到所需的序列。（我没有为大写和小写字母而烦恼）。这个解决方案的问题是很难将输入文本值保留在 yytext 中以供以后在解析器中使用。它会丢弃它，如果构造是注释，这将是可以的，但否则没有那么有用。

d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g}   {
   register int c;

                     for ( ; ; )
                         {
                         /* Not dealt with upper case .. left as an exercise */
                         while ( (c = input()) != 'c' &&
                                 c != EOF )
                             ;    /* eat up text of dog */

                         if ( c == 'c' )
                             {
                              if ( ( c = input()) == 'a' )
                                     if ( (c = input()) == 't' )
                                 break;    /* found the end */
                             }
                        if ( c == EOF )
                             {
                             REJECT;
                             break;
                             }
                         }
            /* because we have used input() yytext always contains "dog" */
            printf("cat: %s\n", yytext);
       }
{ws}  /* skip */ ;

（这两种解决方案都已经过测试）

score 2 · Accepted Answer

一个好问题。这是一个纯正则表达式解决方案，不使用非贪婪.*?语法：

Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

score 0 · Accepted Answer

这是针对此问题的最小 C++ 弹性词法分析器。非贪婪匹配的关键是 flex 手册和其他地方提到的开始条件。

开始条件只是词法分析器的另一种状态。当需要非贪婪匹配时，有一些模式需要在第一次出现时终止匹配

一般来说，如果您正在寻找目标字符串或模式，无论状态如何，您只需要确保没有其他更通用的模式可以匹配包含目标模式的更长输入

当目标模式是有条件的并且需要在一些较早的匹配后启用时，开始条件会有所帮助。您打开开始条件以启用匹配目标模式并通过将状态重置为 0 或INITIAL- 或切换到另一个状态以进行更多条件匹配来关闭它

状态切换BEGIN- 还有一个状态堆栈用于通过yy_push_state和yy_pop_state

flex手册中有很多启动条件的例子

以下是显示与 flex 开始条件的非贪婪匹配的 flex 规则 - 词法分析器匹配一行上第一次出现的 dog 直到第一次出现 cat - 匹配不区分大小写

完整的文件发布在最后——对于不熟悉 flex 的人，请注意许多行以空格开头——这不是偶然的，而且是 flex 所必需的

%%
 /* flex rules section */

 string match;

dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
 BEGIN(HAVE_DOG);
// save the found dog
 match = yytext;
}

 /* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;

<HAVE_DOG>cat {
// save the found cat
 match += yytext;
// output the matched dog and cat
 cout << match << "\n";
// ignore rest of line
 BEGIN(SKIP_LINE);
}

 /* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);

 /* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
  .*
  \n BEGIN(0);
}

 /* nothing to do yet */
.|\n

这是一些测试输入

$ cat dogcat.in.txt

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
cat dog dog cat cat
dog kitten cat dog cat
dig cat dog can dog cut
dig dug dog cow cat cat
doc dogcat catdog
dog dog dog
cat cat cat

构建

flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp

运行

$ ./dogcat < dogcat.in.txt

Dog Ca Cat
dog Ca cat
dOg Ca cAt
DOG CA CAT
dog dog cat
dog kitten cat
dog cow cat
dogcat

完整的 flex 文件

 /* dogcat.flex.l */

 /*
 Build with:
 flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
 */

 /*
 A minimal C++ flex lexer that shows nongreedy matching with flex
 start conditions
 matches the first occurrence of dog on a line till the first
 occurrence of cat 
 matching is case insensitive
 */

 /* C++ lexer using yyFlexLexer in FlexLexer.h */
%option c++

 /* case-insensitive patterns */
%option case-insensitive

 /* generate main function for executable */
%option main

 /* all input must be matched, no echo by default */
%option nodefault

 /* debug output with lexer.set_debug(1) */
%option debug

 /* start condition means dog was matched */
%x HAVE_DOG
 /* start condition means to ignore remaining line */
%x SKIP_LINE

%{

#include <string>
#include <iostream>

// C++ flex lexer class
// needed because header itself has no guard
#ifndef yyFlexLexerOnce
#  include <FlexLexer.h>
#endif

using namespace std;

namespace {
// the C++ lexer class from flex
  yyFlexLexer lexer;

// main generated by flex still calls free yylex function even for C++ lexer
  int yylex() {
    return lexer.yylex();
  }
}

%}

%%
 /* flex rules section */

 string match;

dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
 BEGIN(HAVE_DOG);
// save the found dog
 match = yytext;
}

 /* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;

<HAVE_DOG>cat {
// save the found cat
 match += yytext;
// output the matched dog and cat
 cout << match << "\n";
// ignore rest of line
 BEGIN(SKIP_LINE);
}

 /* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);

 /* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
  .*
  \n BEGIN(0);
}

 /* nothing to do yet */
.|\n

regex - Flex 中的非贪婪正则表达式匹配

3 回答 3

开始条件

手动输入

Related

Reference