c# - 正则表达式如何在后缀中找到“不连续的后缀”

Question

我有这个文档，其中包含许多混合了 2 种语言的文本行，如下所示：（查看单词 עשמ 和 טקסט）

<a href="http://www.example.co.il/search/index.aspx?sQuery=ID:עשמ@111/13&CaseType=טקסט" />

目标：
我想要做的是将“其他语言”文本部分替换为编码的部分。

问题：
我只得到“其他语言”文本的第一个字母。

我正在使用这种正则表达式模式：

((href=\"http://.+?sQuery=[^\"]*)([א-ת]+)([^\"]*\"))+?

这是该方法的完整代码：

string[] files = Directory.GetFiles(@"C:\Test", "*.html", SearchOption.AllDirectories);
foreach (string file in files)
{
   string fileContent = File.ReadAllText(file, Encoding.GetEncoding(1255)); 
   fileContent = fileContent.Replace("windows-1255", "utf-8");      
   Regex hrefRegex = new Regex("((href=\"http://.+?sQuery=[^\"]*)([א-ת]+)([^\"]*\"))+?");

   fileContent = Regex.Replace(fileContent,hrefRegex.ToString(), delegate(Match match)
   {
       string textToEncode = match.Groups[3].Value;
       string encodedText = HttpUtility.UrlEncode(textToEncode, new UTF8 Encoding(false)).ToUpper();
       return match.Groups[2].Value + encodedText + match.Groups[4].Value;
   });          

File.WriteAllText(file + "_fix.html", fileContent, new UTF8Encoding(false));
}

我究竟做错了什么？

以及如何更新我的正则表达式模式，以便它可以在 href 中找到所有“其他语言”部分，因为现在我只带第一个。

score 1 · Accepted Answer

它只有一个匹配项，即整个字符串。如果你想逐个字符翻译，你必须使用这个正则表达式：([א-ת])如果你想翻译每个单词，使用这个：([א-ת]+)。

编辑：要在 href 部分翻译这些字符，请执行以下操作：

            fileContent = Regex.Replace(fileContent, hrefRegex , delegate(Match match)
            {
                string textToEncode = match.ToString();
                textToEncode = Regex.Replace(textToEncode, "[א-ת]", delegate(Match smallMatch)
                {
                    return HttpUtility.UrlEncode(smallMatch.ToString(), new UTF8 Encoding(false)).ToUpper();
                });
                return textToEncode;
            });

c# - 正则表达式如何在后缀中找到“不连续的后缀”

1 回答 1

Related

Reference