c# - RegEx for a Glossary Function

Question

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.

"Manuals and labels" (describes these concepts in general) "Delete Manuals and Labels" (describes this specific action) "Learn more about adding manuals and labels" (again, more specific action)

I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:

<a href="#">Learn more about manuals and labels</a>

I end up with

<a href="#">Learn more about <a href="#">manuals and labels</a></a>

Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:

Learn more about <a href="#">manuals and labels</a>

Without the outer link I really need.

The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:

Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);

string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));

And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.

It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.

Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?

score 1 · Accepted Answer

I would use an ordered dictionary like this, making sure the smallest term is last:

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;

public class Test
{
    public static void Main()
    {
        OrderedDictionary Links = new OrderedDictionary();
        Links.Add("Learn more about adding manuals and labels", "2");
        Links.Add("Delete Manuals and Labels", "3");
        Links.Add("manuals and labels", "1");

        string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";

        foreach (string termToFind in Links.Keys)
        {
            Regex _regex = new Regex(@"\b" + termToFind + @"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
            text = _regex.Replace(text, @"<a href=""" + Links[termToFind] + @".html"">$&</a>");
        }
        Console.WriteLine(text);
    }
}

ideone demo

The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.

score 1 · Accepted Answer

从带有嵌套链接的示例中，听起来您正在对条款进行单独传递并执行多个Regex.Replace调用。由于您使用的是正则表达式，您应该让它完成繁重的工作，并将一个很好的模式放在一起，利用交替。

换句话说，您可能需要这样的模式：\b(term1|term2|termN)\b

var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[] 
{
    "Learn more about adding manuals and labels",
    "Delete Manuals and Labels",
    "manuals and labels"
};

var pattern = @"\b(" + String.Join("|", terms) + @")\b";
var replacement = @"<a href=""#"">$1</a>";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);

现在，要解决每个术语对应的 href 值的问题，您可以使用字典并将正则表达式更改为使用MatchEvaluator将返回自定义格式并从字典中查找值的 a。字典还通过传入忽略大小写StringComparer.OrdinalIgnoreCase。我通过在组的开头添加来稍微调整模式?:以使其成为非捕获组，因为我不再像在第一个示例中那样引用捕获的项目。

var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
    { "Learn more about adding manuals and labels", "2.html" },
    { "Delete Manuals and Labels", "3.html" },
    { "manuals and labels", "1.html" }
};

var pattern = @"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + @")\b";
var result = Regex.Replace(input, pattern,
    m => String.Format(@"<a href=""{0}"">{1}</a>", terms[m.Value], m.Value),
    RegexOptions.IgnoreCase);

Console.WriteLine(result);

score 0 · Accepted Answer

首先，您可以使用lookbehind防止您的 Regex formanuals and labels查找。修改您的正则表达式如下所示：Learn more about manuals and labels

(?<!Learn more about )(manuals and labels)

但是对于您的具体要求，我会建议一个不同的解决方案。您应该为您的正则表达式定义一个规则或优先级列表，或两者兼而有之。一个可能的规则可能是“总是首先搜索匹配最多字符的正则表达式”。但是，这要求您的正则表达式始终是固定长度。并且它不会阻止一个正则表达式使用和替换本应由不同正则表达式匹配的字符（甚至可能具有相同的大小）。

当然，您需要为每个正则表达式添加额外的后向和前瞻，以防止替换替换元素内的字符串

c# - RegEx for a Glossary Function

3 回答 3

Related

Reference