3

我想用正则表达式匹配 c# 源代码的关键字。假设我有“新”关键字。我想匹配所有不在“”、//(评论)和/* */(第二条评论)内的“新”关键字

到目前为止,我已经写过:

\b[^\w@]new\b

但是它不适用于

new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */

我怎样才能改进那个正则表达式?

4

2 回答 2

2

描述

捕获所有不受欢迎的匹配和所有好的东西会更容易。然后稍后在编程逻辑测试中查看是否填充了捕获组,如果是,那么它就是您想要的匹配项。

该表达式将:

  • 避免所有单引号和双引号文本块,例如"new"'new'
  • 避免所有块评论部分,如/* new */
  • 避免所有单行注释// new
  • 任何未引用或注释的关键字,如new,varfoo

(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)

在此处输入图像描述

例子

我不知道 c#,所以我提供了一个 powershell 示例来演示我将如何实现这一点。我使表达式不区分大小写并通过使用打开“点匹配新行”,(?is)并且必须将表达式中的所有单引号转义为''.

代码

$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear

[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'

# cycle through all matches
$Regex.matches($String) | foreach {

    # Capture group 1 collects the comments, if populated then this match is a comment
    if ($_.Groups[1].Value) {
        Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
        } # end if

    # capture group 2 collects the quoted strings, if populated then this match is a quoted string
    if ($_.Groups[2].Value) {
        Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
        } # end if

    # capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
    if ($_.Groups[3].Value) {
        Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
        } # end if

    # capture group 4 collects all the other word character chunks, so these might be variable names
    if ($_.Groups[4].Value) {
        Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
        } # end if

    } # next match

输出

keyword at  0  with a value =>  NEW
keyword at  7  with a value =>  var
possible variable name at  11  with a value =>  a
keyword at  15  with a value =>  NEw
quoted string at  23  with a value =>  "bla"
keyword at  33  with a value =>  var
possible variable name at  37  with a value =>  string
quoted string at  46  with a value =>  "new"
keyword at  53  with a value =>  foo
comment at  60  with a value =>  // new

comment at  68  with a value =>  /*
new
*/
于 2013-07-12T01:09:47.513 回答
1

很简单,使用lexer。词法分析器在字符串中查找文本组并从这些组中生成标记。然后为令牌提供“类型”。(定义它是什么的东西)

AC# 关键字是定义的C# 关键字之一。为此,一个简单的正则表达式将定义边框,后跟可能的 C# 关键字之一。( "\b(new|var|string|...)\b")

您的词法分析器将在给定字符串中查找关键字的所有匹配项,为每个匹配项创建一个标记,并说该标记"type""keyword".

但是,就像您说的那样,您不想在引号或评论中找到关键字。这是词法分析器真正获得积分的地方。

为了解决这种情况,(基于正则表达式的)词法分析器将使用两种方法:

  1. 删除另一个匹配项包含的所有匹配项。
  2. 删除与另一个使用相同空间但优先级较低的匹配项。

词法分析器按以下步骤工作:

  1. 从正则表达式中查找所有匹配项
  2. 将它们转换为令牌
  3. 按索引对令牌进行排序
  4. 循环遍历比较当前匹配和下一个匹配的每个标记,如果下一个匹配部分包含在此匹配中(或者如果它们都占用相同的空间),则将其删除。

Spoiler Alert 下面是一个功能齐全的词法分析器。它将演示词法分析器的工作原理,因为它是一个功能齐全的词法分析器。

例如:

给定字符串、注释和关键字的正则表达式,展示词法分析器如何解决它们之间的冲突。

//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";

//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";

//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";

//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
    {"String", StringRegex},
    {"Comment", CommentRegex},
    {"Keyword", KeywordRegex}
};

//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";


//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
//    the current match with the next match,
//    if the next match is partially contained by this match
//    (or if they both occupy the same space) remove it.


//** Sorry for the complex LINQ expression (not really) **

//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b => 
        new
        {
            Index = b.Index,
            Value = b.Value,
            Type = a.Key //Type is based on the current regex.
        }))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();

//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
    //Compare the current token with the next token to see if it is contained
    if (i + 1 < matches.Count)
    {
        int firstEndPos = (matches[i].Index + matches[i].Value.Length);
        if (firstEndPos > matches[(i + 1)].Index)
        {
            //Remove the next token from the list and stay at
            //the current match
            matches.RemoveAt(i + 1);
            i--;
        }
    }
}

//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
    Console.WriteLine(match);
}
Console.ReadLine();

这是一个有效的(我测试过)几乎完整的词法分析器。(随意使用它或编写自己的)它将找到您在正则表达式中定义的所有关键字,而不会将它们与字符串文字或注释混淆。

于 2013-07-12T03:32:15.053 回答