c# - 获取 HTML 代码中的彩色文本

Question

我有一个 Html 代码，我想将其转换为纯文本，但只保留彩色文本标签。例如：当我有以下 Html 时：

<body>

This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>

我想要输出：

this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...

score 1 · Accepted Answer

可以使用正则表达式来做到这一点，但是......您不应该使用正则表达式解析 (X)HTML。

我用来解决问题的第一个正则表达式是：

<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>

第 5 组将是十六进制（3 或 6 个十六进制）颜色，第 6 组将是标签内的文本。

显然，这不是最好的解决方案，因为我不是正则表达式大师，显然它需要一些测试和可能的概括......但它仍然是一个很好的起点。

score 1 · Accepted Answer

我会使用解析器来解析 HTML，如HtmlAgilityPack，并使用正则表达式来查找color属性中的值。

首先，使用 xpath查找包含style其中定义的属性的所有节点：color

var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
    .SelectNodes("//*[contains(@style, 'color')]")
    .ToArray();

然后是匹配颜色值的最简单的正则表达式：(?<=color:\s*)#?\w+.

var colorRegex = new Regex(@"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);

然后遍历这些节点，如果存在正则表达式匹配，则将节点的内部 html 替换为 html 编码的标签（稍后您会明白为什么）：

foreach (var node in nodes)
{
    var style = node.Attributes["style"].Value;
    if (colorRegex.IsMatch(style))
    {
        var color = colorRegex.Match(style).Value;
        node.InnerHtml =
            HttpUtility.HtmlEncode("<" + color + ">") +
            node.InnerHtml +
            HttpUtility.HtmlEncode("</" + color + ">");
    }
}

最后获取文档的内部文本并对其进行html解码（这是因为内部文本剥离了所有标签）：

var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);

这应该返回如下内容：

This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...

当然，您可以根据自己的需要对其进行改进。

c# - 获取 HTML 代码中的彩色文本

2 回答 2

Related

Reference