c# - 匹配源代码关键字

Question

我想用正则表达式匹配 c# 源代码的关键字。假设我有“新”关键字。我想匹配所有不在“”、//（评论）和/* */（第二条评论）内的“新”关键字

到目前为止，我已经写过：

\b[^\w@]new\b

但是它不适用于：

new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */

我怎样才能改进那个正则表达式？

score 2 · Accepted Answer

描述

捕获所有不受欢迎的匹配和所有好的东西会更容易。然后稍后在编程逻辑测试中查看是否填充了捕获组，如果是，那么它就是您想要的匹配项。

该表达式将：

避免所有单引号和双引号文本块，例如"new"或'new'
避免所有块评论部分，如/* new */
避免所有单行注释// new
任何未引用或注释的关键字，如new,var和foo

(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)

在此处输入图像描述

例子

我不知道 c#，所以我提供了一个 powershell 示例来演示我将如何实现这一点。我使表达式不区分大小写并通过使用打开“点匹配新行”，(?is)并且必须将表达式中的所有单引号转义为''.

代码

$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear

[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'

# cycle through all matches
$Regex.matches($String) | foreach {

    # Capture group 1 collects the comments, if populated then this match is a comment
    if ($_.Groups[1].Value) {
        Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
        } # end if

    # capture group 2 collects the quoted strings, if populated then this match is a quoted string
    if ($_.Groups[2].Value) {
        Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
        } # end if

    # capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
    if ($_.Groups[3].Value) {
        Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
        } # end if

    # capture group 4 collects all the other word character chunks, so these might be variable names
    if ($_.Groups[4].Value) {
        Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
        } # end if

    } # next match

输出

keyword at  0  with a value =>  NEW
keyword at  7  with a value =>  var
possible variable name at  11  with a value =>  a
keyword at  15  with a value =>  NEw
quoted string at  23  with a value =>  "bla"
keyword at  33  with a value =>  var
possible variable name at  37  with a value =>  string
quoted string at  46  with a value =>  "new"
keyword at  53  with a value =>  foo
comment at  60  with a value =>  // new

comment at  68  with a value =>  /*
new
*/

score 1 · Accepted Answer

很简单，使用lexer。词法分析器在字符串中查找文本组并从这些组中生成标记。然后为令牌提供“类型”。（定义它是什么的东西）

AC# 关键字是定义的C# 关键字之一。为此，一个简单的正则表达式将定义边框，后跟可能的 C# 关键字之一。( "\b(new|var|string|...)\b")

您的词法分析器将在给定字符串中查找关键字的所有匹配项，为每个匹配项创建一个标记，并说该标记"type"是"keyword".

但是，就像您说的那样，您不想在引号或评论中找到关键字。这是词法分析器真正获得积分的地方。

为了解决这种情况，（基于正则表达式的）词法分析器将使用两种方法：

删除另一个匹配项包含的所有匹配项。
删除与另一个使用相同空间但优先级较低的匹配项。

词法分析器按以下步骤工作：

从正则表达式中查找所有匹配项
将它们转换为令牌
按索引对令牌进行排序
循环遍历比较当前匹配和下一个匹配的每个标记，如果下一个匹配部分包含在此匹配中（或者如果它们都占用相同的空间），则将其删除。

Spoiler Alert 下面是一个功能齐全的词法分析器。它将演示词法分析器的工作原理，因为它是一个功能齐全的词法分析器。

例如：

给定字符串、注释和关键字的正则表达式，展示词法分析器如何解决它们之间的冲突。

//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";

//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";

//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";

//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
    {"String", StringRegex},
    {"Comment", CommentRegex},
    {"Keyword", KeywordRegex}
};

//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";


//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
//    the current match with the next match,
//    if the next match is partially contained by this match
//    (or if they both occupy the same space) remove it.


//** Sorry for the complex LINQ expression (not really) **

//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b => 
        new
        {
            Index = b.Index,
            Value = b.Value,
            Type = a.Key //Type is based on the current regex.
        }))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();

//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
    //Compare the current token with the next token to see if it is contained
    if (i + 1 < matches.Count)
    {
        int firstEndPos = (matches[i].Index + matches[i].Value.Length);
        if (firstEndPos > matches[(i + 1)].Index)
        {
            //Remove the next token from the list and stay at
            //the current match
            matches.RemoveAt(i + 1);
            i--;
        }
    }
}

//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
    Console.WriteLine(match);
}
Console.ReadLine();

这是一个有效的（我测试过）几乎完整的词法分析器。（随意使用它或编写自己的）它将找到您在正则表达式中定义的所有关键字，而不会将它们与字符串文字或注释混淆。

c# - 匹配源代码关键字

2 回答 2

描述

例子

Related

Reference