python - 用于 Java 注释的 Python 正则表达式

Question

我正在尝试检测文本中的有效 Java 注释。这是我的测试程序（为了简单起见，我目前忽略了所有空格，稍后我将添加它）：

txts = ['@SomeName2',                   # match
        '@SomeName2(',                  # no match
        '@SomeName2)',                  # no match 
        '@SomeName2()',                 # match
        '@SomeName2()()',               # no match
        '@SomeName2(value)',            # no match
        '@SomeName2(=)',                # no match
        '@SomeName2("")',               # match
        '@SomeName2(".")',              # no match
        '@SomeName2(",")',              # match
        '@SomeName2(value=)',           # no match
        '@SomeName2(value=")',          # no match
        '@SomeName2(=3)',               # no match
        '@SomeName2(="")',              # no match
        '@SomeName2(value=3)',          # match
        '@SomeName2(value=3L)',         # match
        '@SomeName2(value="")',         # match
        '@SomeName2(value=true)',       # match
        '@SomeName2(value=false)',      # match
        '@SomeName2(value=".")',        # no match
        '@SomeName2(value=",")',        # match
        '@SomeName2(x="o_nbr ASC, a")', # match

        # multiple params:
        '@SomeName2(,value="ord_nbr ASC, name")',                            # no match
        '@SomeName2(value="ord_nbr ASC, name",)',                            # no match
        '@SomeName2(value="ord_nbr ASC, name"insertable=false)',             # no match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false)',            # match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false,length=10L)', # match

        '@SomeName2 ( "ord_nbr ASC, name", insertable = false, length = 10L )',       # match
       ]


#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?\))?$'
#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

regex = r"""
    (?:@[a-z]\w*)                               # @ + identifier (class name)
    (
      \(                                        # opening parenthesis
        (
          (?:[a-z]\w*)                          # identifier (var name)
          =                                     # assigment operator
          (\d+l?|"(?:[a-z0-9_, ]*)"|true|false) # either a numeric | a quoted string containing only alphanumeric chars, _, space | true | false
        )?                                      # optional assignment group
      \)                                        # closing parenthesis
    )?$                                         # optional parentheses group (zero or one)
    """


rg = re.compile(regex, re.VERBOSE + re.IGNORECASE)

for txt in txts:
    m = rg.search(txt)
    #m = rg.match(txt)
    if m:
        print "MATCH:   ",
        output = ''
        for i in xrange(2):
            output = output + '[' + str(m.group(i+1)) + ']'
        print output
    else:
        print "NO MATCH: " + txt

所以基本上我所拥有的似乎适用于零个或一个参数。现在我正在尝试将语法扩展到零个或多个参数，就像在上一个示例中一样。

然后，我复制了代表分配的正则表达式部分，并在第 2 到第 n 组（该组现在使用 * 而不是？）前面加上逗号：

regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

然而那行不通。问题似乎是如何处理第一个元素，因为它必须是可选的，然后像第一个扩展示例这样的字符串'@SomeName2(,value="ord_nbr ASC, name")'将被接受，这是错误的。我不知道如何使第 2 次到第 n 次分配仅取决于第一个（可选）元素的存在。

可以做到吗？是这样做的吗？你如何最好地解决这个问题？

谢谢

score 2 · Accepted Answer

如果你只是想检测有效的语法，我相信下面的正则表达式会给你你想要的匹配。但我不确定你对这些小组做了什么。您是否也希望每个参数值都在其自己的组中？那会更难，我什至不确定正则表达式是否可行。

regex = r'((?:@[a-z][a-z0-9_]*))(?:\((?!,)(?:(([a-z][a-z0-9_]*(=)(?:("[a-z0-9_, ]*")|(true|false)|(\d+l?))))(?!,\)),?)*\)(?!\()|$)'

如果您需要单独的参数/值，您可能需要为此编写一个真正的解析器。

编辑： 这是一个评论版本。我还删除了许多捕获和非捕获组，以使其更易于理解。如果将其与re.findall()它一起使用，它将返回两组：函数名和括号中的所有参数：

regex = r'''
(@[a-z][a-z0-9_]*) # function name, captured in group
(                  # open capture group for all parameters
\(                 # opening function parenthesis 
  (?!,)            # negative lookahead for unwanted comma
  (?:              # open non-capturing group for all params
  [a-z][a-z0-9_]*  # parameter name
  =                # parameter assignmentoperators
  (?:"[a-z0-9_, ]*"|true|false|(?:\d+l?)) # possible parameter values
  (?!,\))          # negative lookahead for unwanted comma and closing parenthesis
  ,?               # optional comma, separating params
  )*               # close param non-capturing group, make it optional
\)                 # closing function parenthesis 
(?!\(\))           # negative lookahead for empty parentheses
|$                 # OR end-of-line (in case there are no params)
)                  # close capture group for all parameters
'''

在阅读了您对参数的评论后，最简单的事情可能是使用上面的正则表达式提取所有参数，然后编写另一个正则表达式提取名称/值对以按照您的意愿处理。不过，这也很棘手，因为参数值中有逗号。我将把它作为练习留给读者:)

score 1 · Accepted Answer

使用 re.VERBOSE 标志

你在这里做了一些有趣的事情。这是您的原始正则表达式：

regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"
(?:[a-z0-9_, ]*)"|true|false))?\))?$'

对于初学者，请使用 re.VERBOSE 标志，这样您就可以将其拆分为多行。这样正则表达式中的空格和注释不会影响其含义，因此您可以记录正则表达式试图做什么。

regex = re.compile("""
((?:@[a-z][a-z0-9_]*))     # Match starting symbol, @-sign followed by a word
(\(
    (((?:[a-z][a-z0-9_]*))                     # Match arguments??
    (=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))? # ?????
\))?$
""", re.VERBOSE + re.IGNORECASE)

由于您还没有记录这个正则表达式试图做什么，我不能进一步分解它。使用 re.VERBOSE 记录任何重要的正则表达式的意图，将其拆分为多行并对其进行注释。

将问题分解为可管理的部分

您的正则表达式很难理解，因为它试图做的太多。就目前而言，您的正则表达式正在尝试做两件事：

匹配形式的符号名称@SomeSymbol2，可选地后跟带括号的参数列表，(arg1="val1",arg2="val2"...)
验证带括号的参数列表的内容，以便(arg1="val1",arg2="val2")通过但(232,211)不通过。

我建议将其分为两部分，如下所示：

import re
import pprint

txts = [
        '@SomeName2',              # match
        '@SomeName2(',             # no match
        '@SomeName2)',             # no match 
        '@SomeName2()',            # match
        '@SomeName2()()',          # no match
        '@SomeName2(value)',       # no match
        '@SomeName2(=)',           # no match
        '@SomeName2("")',          # no match
        '@SomeName2(value=)',      # no match
        '@SomeName2(value=")',     # no match
        '@SomeName2(=3)',          # no match
        '@SomeName2(="")',         # no match
        '@SomeName2(value=3)',     # match
        '@SomeName2(value=3L)',    # match
        '@SomeName2(value="")',    # match
        '@SomeName2(value=true)',  # match
        '@SomeName2(value=false)', # match
        '@SomeName2(value=".")',   # no match
        '@SomeName2(value=",")',   # match
        '@SomeName2(value="ord_nbr ASC, name")', # match

        # extension needed!:
        '@SomeName2(,value="ord_nbr ASC, name")', # no match
        '@SomeName2(value="ord_nbr ASC, name",)', # no match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false)'
        ] # no match YET, but should

# Regular expression to match overall @symbolname(parenthesised stuff)
regex_1 = re.compile( r"""
^                   # Start of string
(@[a-zA-Z]\w*)      # Matches initial token. Token name must start with a letter.
                    # Subsequent characters can be any of those matched by \w, being [a-zA-Z0-9_]
                    # Note behaviour of \w is LOCALE dependent.
( \( [^)]* \) )?    # Optionally, match parenthesised part containing zero or more characters
$                   # End of string
""", re.VERBOSE)

#Regular expression to validate contents of parentheses
regex_2 = re.compile( r"""
^
(
    ([a-zA-Z]\w*)       # argument key name (i.e. 'value' in the examples above)
    =                   # literal equals symbol
    (                   # acceptable arguments are:
        true  |         # literal "true"
        false |         # literal "false"
        \d+L? |         # integer (optionally followed by an 'L')
        "[^"]*"         # string (may not contain quote marks!)
    )
    \s*,?\s*            # optional comma and whitespace
)*                      # Match this entire regex zero or more times
$
""", re.VERBOSE)

for line in txts:
    print("\n")
    print(line)
    m1 = regex_1.search(line)    

    if m1:
        annotation_name, annotation_args = m1.groups()

        print "Symbol name   : ", annotation_name
        print "Argument list : ", annotation_args

        if annotation_args:
            s2 = annotation_args.strip("()")
            m2 = regex_2.search(s2)
            if (m2):
                pprint.pprint(m2.groups())
                print "MATCH"
            else:
                print "MATCH FAILED: regex_2 didn't match. Contents of parentheses were invalid."
        else:
            print "MATCH"

    else:
        print "MATCH FAILED: regex_1 didn't match."

这几乎可以让您找到最终解决方案。我能看到的唯一极端情况是这（错误地）匹配参数列表中的尾随逗号。（您可以使用简单的字符串操作来检查这一点str.endswith()。）

事后编辑：参数列表的语法实际上非常接近真实的数据格式——你可能会提供argument_list给 JSON 或 YAML 解析器，它会告诉你它是否好。如果可以的话，使用现有的轮子（JSON 解析器）而不是重新发明轮子。

除其他外，这将允许 -

识别 Javascript 支持的所有参数类型，包括浮点数等
支持字符串中的转义引号。现在，正则表达式将停止工作，"This is a quote mark: \"."因为它认为第二个引号结束了字符串。（它没有。）

这可以在正则表达式中完成，但它既可怕又复杂。

python - 用于 Java 注释的 Python 正则表达式

2 回答 2

使用 re.VERBOSE 标志

将问题分解为可管理的部分

Related

Reference