c# - 按组名匹配正则表达式中的组的更好方法是什么

Question

我已阅读如何在 C# 正则表达式中获取捕获组的名称？以及如何访问 .NET 正则表达式中的命名捕获组？尝试了解如何在正则表达式中找到匹配组的结果。

我还阅读了 MSDN 中的所有内容，网址为http://msdn.microsoft.com/en-us/library/30wbz966.aspx

对我来说奇怪的是 C#（或 .NET）似乎是唯一的正则表达式实现，它使您迭代组以查找匹配的组（特别是如果您需要名称），而且名称不是t 与组结果一起存储。例如，PHP 和 Python 将为您提供匹配的组名作为 RegEx 匹配结果的一部分。

我必须迭代组并检查匹配项，并且我必须保留我自己的组名列表，因为这些名称不在结果中。

这是我要演示的代码：

public class Tokenizer
{
    private Dictionary<string, string> tokens;

    private Regex re;

    public Tokenizer()
    {
        tokens = new Dictionary<string, string>();
        tokens["NUMBER"] = @"\d+(\.\d*)?";  // Integer or decimal number
        tokens["STRING"] = @""".*""";       // String
        tokens["COMMENT"] = @";.*";         // Comment
        tokens["COMMAND"] = @"[A-Za-z]+";   // Identifiers
        tokens["NEWLINE"] = @"\n";          // Line endings
        tokens["SKIP"] = @"[ \t]";          // Skip over spaces and tabs

        List<string> token_regex = new List<string>();
        foreach (KeyValuePair<string, string> pair in tokens)
        {
            token_regex.Add(String.Format("(?<{0}>{1})", pair.Key, pair.Value));
        }
        string tok_regex = String.Join("|", token_regex);

        re = new Regex(tok_regex);
    }

    public List<Token> parse(string pSource)
    {
        List<Token> tokens = new List<Token>();

        Match get_token = re.Match(pSource);
        while (get_token.Success)
        {
            foreach (string gname in this.tokens.Keys)
            {
                Group group = get_token.Groups[gname];
                if (group.Success)
                {
                    tokens.Add(new Token(gname, get_token.Groups[gname].Value));
                    break;
                }
            }

            get_token = get_token.NextMatch();
        }
        return tokens;
    }
}

在行

foreach (string gname in this.tokens.Keys)

这不应该是必要的，但它是。

无论如何都可以找到匹配的组及其名称而无需遍历所有组？

编辑：比较实现。这是我为 Python 实现编写的相同代码。

class xTokenizer(object):
    """
    xTokenizer converts a text source code file into a collection of xToken objects.
    """

    TOKENS = [
        ('NUMBER',  r'\d+(\.\d*)?'),    # Integer or decimal number
        ('STRING',  r'".*"'),           # String
        ('COMMENT', r';.*'),            # Comment
        ('VAR',     r':[A-Za-z]+'),     # Variables
        ('COMMAND', r'[A-Za-z]+'),      # Identifiers
        ('OP',      r'[+*\/\-]'),       # Arithmetic operators
        ('NEWLINE', r'\n'),             # Line endings
        ('SKIP',    r'[ \t]'),          # Skip over spaces and tabs
        ('SLIST',   r'\['),             # Start a list of commands
        ('ELIST',   r'\]'),             # End a list of commands
        ('SARRAY',  r'\{'),             # Start an array
        ('EARRAY',  r'\}'),             # End end an array
    ]

    def __init__(self,tokens=None):
        """
        Constructor
            Args:
                tokens - key/pair of regular expressions used to match tokens.
        """
        if tokens is None:
            tokens = self.TOKENS
        self.tokens = tokens
        self.tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in tokens)
        pass

    def parse(self,source):
        """
        Converts the source code into a list of xToken objects.
            Args:
                sources - The source code as a string.
            Returns:
                list of xToken objects.
        """
        get_token = re.compile(self.tok_regex).match
        line = 1
        pos = line_start = 0
        mo = get_token(source)
        result = []
        while mo is not None:
            typ = mo.lastgroup
            if typ == 'NEWLINE':
                line_start = pos
                line += 1
            elif typ != 'SKIP':
                val = mo.group(typ)
                result.append(xToken(typ, val, line, mo.start()-line_start))
            pos = mo.end()
            mo = get_token(source, pos)
        if pos != len(source):
            raise xParserError('Unexpected character %r on line %d' %(source[pos], line))
        return result

As you can see Python doesn't require you to iterate the groups, and a similar thing can be done in PHP and I assume Java.

score 1 · Accepted Answer

All your token types start with different characters. How about compiling a HashSet<char,string> that maps all possible start characters to the matching group name? That way you only have to examine the first character of the entire match to figure out which group was matched.

score 1 · Accepted Answer

There's no need to maintain a separate list of named groups. Use the Regex.GetGroupNames method instead.

Your code would then look similar to this:

foreach (string gname in re.GetGroupNames())
{
    Group group = get_token.Groups[gname];
    if (group.Success)
    {
        // your code
    }
}

That said, be aware of this note on the MSDN page:

Even if capturing groups are not explicitly named, they are automatically assigned numerical names (1, 2, 3, and so on).

With that in mind, you should either name all your groups, or filter out numeric group names. You could do so with some LINQ, or with an additional check that !Char.IsNumber(gname[0]) to check the first character of the group name, making the assumption that any such group is invalid. Alternately, you could also use the int.TryParse method.

c# - 按组名匹配正则表达式中的组的更好方法是什么

2 回答 2

Related

Reference