我想用正则表达式匹配 c# 源代码的关键字。假设我有“新”关键字。我想匹配所有不在“”、//(评论)和/* */(第二条评论)内的“新”关键字
到目前为止,我已经写过:
\b[^\w@]new\b
但是它不适用于:
new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */
我怎样才能改进那个正则表达式?
捕获所有不受欢迎的匹配和所有好的东西会更容易。然后稍后在编程逻辑测试中查看是否填充了捕获组,如果是,那么它就是您想要的匹配项。
该表达式将:
"new"
或'new'
/* new */
// new
new
,var
和foo
(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)
我不知道 c#,所以我提供了一个 powershell 示例来演示我将如何实现这一点。我使表达式不区分大小写并通过使用打开“点匹配新行”,(?is)
并且必须将表达式中的所有单引号转义为''
.
代码
$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear
[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'
# cycle through all matches
$Regex.matches($String) | foreach {
# Capture group 1 collects the comments, if populated then this match is a comment
if ($_.Groups[1].Value) {
Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
} # end if
# capture group 2 collects the quoted strings, if populated then this match is a quoted string
if ($_.Groups[2].Value) {
Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
} # end if
# capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
if ($_.Groups[3].Value) {
Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
} # end if
# capture group 4 collects all the other word character chunks, so these might be variable names
if ($_.Groups[4].Value) {
Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
} # end if
} # next match
输出
keyword at 0 with a value => NEW
keyword at 7 with a value => var
possible variable name at 11 with a value => a
keyword at 15 with a value => NEw
quoted string at 23 with a value => "bla"
keyword at 33 with a value => var
possible variable name at 37 with a value => string
quoted string at 46 with a value => "new"
keyword at 53 with a value => foo
comment at 60 with a value => // new
comment at 68 with a value => /*
new
*/
很简单,使用lexer。词法分析器在字符串中查找文本组并从这些组中生成标记。然后为令牌提供“类型”。(定义它是什么的东西)
AC# 关键字是定义的C# 关键字之一。为此,一个简单的正则表达式将定义边框,后跟可能的 C# 关键字之一。( "\b(new|var|string|...)\b"
)
您的词法分析器将在给定字符串中查找关键字的所有匹配项,为每个匹配项创建一个标记,并说该标记"type"
是"keyword"
.
但是,就像您说的那样,您不想在引号或评论中找到关键字。这是词法分析器真正获得积分的地方。
为了解决这种情况,(基于正则表达式的)词法分析器将使用两种方法:
词法分析器按以下步骤工作:
Spoiler Alert 下面是一个功能齐全的词法分析器。它将演示词法分析器的工作原理,因为它是一个功能齐全的词法分析器。
例如:
给定字符串、注释和关键字的正则表达式,展示词法分析器如何解决它们之间的冲突。
//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";
//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";
//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";
//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
{"String", StringRegex},
{"Comment", CommentRegex},
{"Keyword", KeywordRegex}
};
//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";
//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
// the current match with the next match,
// if the next match is partially contained by this match
// (or if they both occupy the same space) remove it.
//** Sorry for the complex LINQ expression (not really) **
//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b =>
new
{
Index = b.Index,
Value = b.Value,
Type = a.Key //Type is based on the current regex.
}))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();
//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
//Compare the current token with the next token to see if it is contained
if (i + 1 < matches.Count)
{
int firstEndPos = (matches[i].Index + matches[i].Value.Length);
if (firstEndPos > matches[(i + 1)].Index)
{
//Remove the next token from the list and stay at
//the current match
matches.RemoveAt(i + 1);
i--;
}
}
}
//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
Console.WriteLine(match);
}
Console.ReadLine();
这是一个有效的(我测试过)几乎完整的词法分析器。(随意使用它或编写自己的)它将找到您在正则表达式中定义的所有关键字,而不会将它们与字符串文字或注释混淆。