0

I am searching to remove duplicates in a document using Regex or something similar; to remove the following:

First Line

<Important text /><Important text />Other random words

I need to remove the duplicates of <some text/> and keep everything else remain as it is. The text may or may not be on multiple lines.

It will need to work off of several different words but use the < > tags.

EDIT:

I do not know what the words will be. Some will be nested inside < > tags and some will not be. I will need to remove all duplicates that repeat one after each other something like:

<text/><text/><words/><words/><words/>

And the output should be:

<text/><words/>
4

4 回答 4

1

这个 Regex 将搜索重复的标签,,(<.+?\/>)(?=\1)这里有一个Regex 101 来证明它

于 2013-08-19T17:32:51.593 回答
0

就个人而言,我不喜欢带有标签的正则表达式。

拆分每个标签上的文本,用 删除重复项Distinct,加入结果,瞧。

string input1 = "<Important text /><Important text />Other random words";
string input2 = "<text/><text/><words/><words/><words/>";

string result1 = RemoveDuplicateTags(input1); // "<Important text />Other random words"
string result2 = RemoveDuplicateTags(input2); // "<text/><words/>"

private string RemoveDuplicateTags(string input)
{
    IEnumerable<string> tagsOrRandomWords = input.Split('>');
    tagsOrRandomWords = tagsOrRandomWords.Distinct();

    return string.Join(">", tagsOrRandomWords);
}

或者,如果您更喜欢可读性较差的单行代码:

private string RemoveDuplicateTags(string input)
{
    return string.Join(">", input.Split('>').Distinct());
}
于 2013-08-19T17:35:49.307 回答
0

你可以使用这个:

Regex.Replace(input, "(<Important text />)+", "<Important text />");

这将用<Important text />单个实例替换重复一次或多次的任何实例<Important text />

或者更简单地说:

Regex.Replace(input, "(<Important text />)+", "$1");

例如:

var input = "<Important text /><Important text />Other random words";
var output = Regex.Replace(input, "(<Important text />)+", "$1");

Console.WriteLine(output); // <Important text />Other random words

如果你想一次处理多个这样的模式,你应该使用一个交替(|),指定你想处理的每个单词,以及一个反向引用(\1)来查找重复:

Regex.Replace(input, @"(<(?:Important text|Other text) />)\1+", "$1");

例如:

var input = "<text/><text/><words/><words/><words/>";
var output = Regex.Replace(input, @"(<(?:text|words)\s*/>)\1+", "$1");

Console.WriteLine(output); // <text/><words/>
于 2013-08-19T17:19:51.430 回答
0

您应该创建一个包含所有标签的字典,即 < 和 /> 之间的所有文本,包括括号,以及它们的计数(这可以使用正则表达式来完成)。然后再次迭代,要么删除重复项,要么不将它们输出到新的字符串/数据结构。

于 2013-08-19T17:22:45.457 回答