c# - 使用正则表达式转换字符串

Question

我有一些需要使用 C# 修改的 HTML 内容。它在概念上很简单，但我不确定如何有效地做到这一点。内容包含多次出现的分隔数字，后跟一个空的锚标记。我需要获取分隔数字并将其插入到锚标记中的 JavaScript 函数调用中。例如

源字符串将包含如下内容：

%%1%%<a href="#"></a> 
<p>A bunch of HTML markup</p>

%%2%%<a href="#"></a>
<p>Some more HTML markup</p>

我需要将其转换为：

<a href="#" onclick="DoSomething('1')></a> 
<p>A bunch of HTML markup</p>

<a href="#" onclick="DoSomething('2')></a>
<p>Some more HTML markup</p>

%%\d+%% 出现的次数没有限制。我尝试编写正则表达式，希望可以使用 Replace 方法，但我不确定这是否可以用于每个组的多个实例。这是我所拥有的：

%%(?<LinkID>\d+)%%(?<LinkStart><a[\s\S]*?)(?:(?<LinkEnd>>[\s\S]*?)(?=%%\d+|$))

// %%(?<LinkID>\d+)%%        Match a number surrounded by %% and put the number in a group named LinkID
// (?<LinkStart><a[\s\S]*?)  Match <a followed by any characters until next match (non greedy), in a group named LinkStart
// (?:                       Logical grouping that does not get captured
// (?<LinkEnd>>[\s\S]*?)     Match > followed by any characters until next match, in a group named LinkEnd
// (?=%%\d+%%|$)             Where the former LinkEnd group is followed by another instance of a delimited number or the end of the string. (I don't think this is working as I intended.)

也许可以使用一些正则表达式操作和 String.Format 的组合。我不是正则表达式方面的专家。

score 1 · Accepted Answer

我会说您的正则表达式几乎就是您想要的 - 我已经稍微改变了它。$如果仅在字符串末尾匹配，这将起作用：

%%(\d+)%%(<a[^>]*)(></a>)(.*?)(?=%%\d|$)

如果您决定使用它，那么对于每个匹配项，您都可以访问组，这样您就可以构造新字符串 - 这可能比替换现有字符串中的内容更容易。

score 1 · Accepted Answer

使用正则表达式解析 HTML 已在 SO 上广泛介绍。共识是不应该这样做。

如果您需要解析 HTML，我建议您使用HTML Agility Pack之类的东西。这允许您使用类似于 xPath 的东西来识别您想要处理的 HTML。

score 0 · Accepted Answer

我会使用 string.split 这个。

string emptyAnchor = "<a href=""#""></a>";
string src = GetData();
string[] splits = src.split(new string[]{"%%"}, StringSplitOptions.None);
StringBuilder sb = new StringBuilder();

//first entry is blank, set to 1
int i = 1;
while(i < splits.length)
{
    string id = splits[i];
    //increment for data string
    i++;
    //prehaps use a StringReplaceFirstOccurrence function instead
    sb.Append(splits[i].Replace(emptyAnchor, GetDataFromID(id)));
    i++;
}
string output = sb.ToString();

score 0 · Accepted Answer

事实证明 Regex.Replace 已经足够聪明，可以处理多个匹配项。我刚刚修改了我的正则表达式以不使用前瞻。我的想法是我在 %% 分隔符中找到数字并将其添加到组中，在下一个锚标记中找到内容并将其添加到组中，然后将整个匹配替换为包含在两组插入其中。replace 方法似乎可以在没有任何额外帮助的情况下自动正确处理后续匹配项。

string originalText = "<h3>%%1%%<a href=\"#\">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>" +
                            "<h3>%%2%%<a href=\"#\">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p>" +
                            "<p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>";

Regex regex = new Regex(@"%%(\d+)%%[\s]*<a[\s\S]*?>([\s\S]*?)</a>");
string result = regex.Replace(originalText, "<a href=\"#\" onclick=\"DoSomething($1)\">$2</a>");
Debug.WriteLine("Original Text: \"" + originalText + "\"");
Debug.WriteLine("Result Text: \"" + result + "\"");

输出：

Original Text: "<h3>%%1%%<a href="#">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3>%%2%%<a href="#">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"

Result Text: "<h3><a href="#" onclick="DoSomething(1)">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3><a href="#" onclick="DoSomething(2)">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"

c# - 使用正则表达式转换字符串

4 回答 4

Related

Reference