python - 如何从正则表达式中排除某些可能性？

Question

对于我正在创建的解析器，我使用这个正则表达式作为 ID 的定义：

ID: /[a-z_][a-z0-9]*/i

（对于不熟悉我正在使用的特定解析器语法的任何人，“i”标志仅表示不区分大小写。）

我也有一些关键字，像这样：

CALL_KW: "call"
PRINT_KW: "print"

问题是，由于语法中的一些歧义，有时关键字被视为 ID，而我真的不希望它们如此。所以我在考虑是否可以重写 ID 的正则表达式，使关键字根本不匹配。这样的事情可能吗？

为了提供更多上下文，我使用 Python 的Lark解析器库。Lark 提供的 Earley 解析器（与动态词法分析器一起）在处理模棱两可的语法方面非常灵活和强大，但它们有时会做这样奇怪的事情（而且是非确定性的！）。所以我试图在这里给解析器一些帮助，通过使关键字永远不匹配 ID 规则。

score 2 · Accepted Answer

我相信 Lark 使用的是普通的 Python 正则表达式，所以你可以使用否定的前瞻断言来排除关键字。但是您必须注意不要拒绝以关键字开头的名称：

ID: /(?!(else|call)\b)[a-z_][a-z0-9]*/i

这个正则表达式当然适用于 Python3：

>>> # Test with just the word
>>> for test_string in ["x", "xelse", "elsex", "else"]:
...   m = re.match(r"(?!(else|call)\b)[a-z_][a-z0-9]*", test_string)
...   if m: print("%s: Matched %s" % (test_string, m.group(0)))
...   else: print("%s: No match" % test_string)
... 
x: Matched x
xelse: Matched xelse
elsex: Matched elsex
else: No match

>>> # Test with the word as the first word in a string
... for test_string in [word + " and more stuff" for word in ["x", "xelse", "elsex", "else"]]:
...   m = re.match(r"(?!(else|call)\b)[a-z_][a-z0-9]*", test_string)
...   if m: print("%s: Matched %s" % (test_string, m.group(0)))
...   else: print("%s: No match" % test_string)
... 
x and more stuff: Matched x
xelse and more stuff: Matched xelse
elsex and more stuff: Matched elsex
else and more stuff: No match

score 0 · Accepted Answer

有几种方法可以不将您的相似值传递给您的 ID。

正则表达式 1

例如，您可以在表达式中使用捕获组，可能类似于：

    ([a-z]+_[a-z0-9]+)

正则表达式电路

此链接可帮助您可视化您的表达式：

正则表达式 2

另一种方法是使用将您的表达式从右侧绑定:，然后您可以使用类似于的表达式：

(\w+):

或带有i标志的原始表达式：

([a-z0-9_]+):

如果您愿意，您可以为其添加更多边界。

python - 如何从正则表达式中排除某些可能性？

2 回答 2

正则表达式 1

正则表达式电路

正则表达式 2

Related

Reference