parsing - 在 flex/lex (parser-generator) 中实现单词边界状态

Question

我希望能够判断模式匹配是出现在单词字符之后还是非单词字符之后。换句话说，我想在 flex/lex 不支持的模式的开头模拟 \b 分词正则表达式字符。

这是我在下面的尝试（不能按预期工作）：

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

输入：

a
ba
a

预期产出

'a' not in word
'a' in word
'a' not in word

实际输出：

a
'a' in word
'a' in word

我这样做是因为我想做方言器之类的事情，而且我一直想学习如何使用真正的词法分析器。有时我要替换的模式需要是单词的片段，有时它们只需要是整个单词。

score 3 · Accepted Answer

Here's what accomplished what I wanted:

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

This way I can do the equivalent of \B or \b at the beginning or end of any pattern. You can match at the end by doing a/{WC} or a/{NW}.

I wanted to set up the states without consuming any characters. The trick is using REJECT rather than yymore(), which I guess I didn't fully understand.

score 1 · Accepted Answer

%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

Testing:

user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word

parsing - 在 flex/lex (parser-generator) 中实现单词边界状态

2 回答 2

Related

Reference