c# - KeyNotFoundException 与使用 HtmlEntity.DeEntitize() 方法

Question

我目前正在研究用 C# 4.0 编写的刮板。我使用各种工具，包括 .NET 的内置 WebClient 和 RegEx 功能。对于我的刮板的一部分，我正在使用 HtmlAgilityPack 解析 HTML 文档。我让一切按我的意愿工作，并进行了一些代码清理。

我正在使用该HtmlEntity.DeEntitize()方法来清理 HTML。我做了一些测试，该方法似乎效果很好。但是当我在我的代码中实现该方法时，我不断得到KeyNotFoundException. 没有更多细节，所以我很迷茫。我的代码如下所示：

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

下载的 HTML 是 UTF-8 编码的。我怎样才能绕过KeyNotFound异常？

score 3 · Accepted Answer

我知道问题是由于出现了非标准字符。比如说中文、日文等。

在您发现是什么字符导致了问题之后，也许您可以在此处搜索 htmlagilitypack 的合适补丁

如果您想自己修改htmlagilitypack 源代码，这可能会对您有所帮助。

score 3 · Accepted Answer

四年后，我对一些编码字符（版本 1.4.9.5）也有同样的问题。在我的例子中，有一组有限的字符可能会产生问题，所以我刚刚创建了一个函数来执行替换：

// to be called before HtmlEntity.DeEntitize
public static string ReplaceProblematicHtmlEntities(string str)
{
    var sb = new StringBuilder(str);
    //TODO: add other replacements, as needed
    return sb.Replace("&period;", ".")
        .Replace("&abreve;", "ă")
        .Replace("&acirc;", "â")
        .ToString();
}

就我而言，该字符串同时包含 html 编码字符和 UTF-8 字符，但问题仅与某些编码字符有关。

这不是一个优雅的解决方案，而是对所有具有有限（和已知）数量有问题的编码字符的文本的快速修复。

score 3 · Accepted Answer

我的 HTML 有一个像这样的文本块：

... found in sections: 233.9 & 517.3; ...

尽管有间距和小数点，但它被解释& 517.3;为 unicode 字符。

简单地对原始文本进行 HTML 编码为我解决了这个问题。

string raw = "sections: 233.9 & 517.3;";
// turn '&' into '&amp;', etc, before DeEntitizing
string encoded = System.Web.HttpUtility.HtmlEncode(raw);
string deEntitized = HtmlEntity.DeEntitize(encoded);

score 0 · Accepted Answer

0

在我的情况下，我通过将 HtmlAgilityPack 更新到版本 1.5.0 来解决这个问题

于 2018-08-22T04:49:55.247 回答

c# - KeyNotFoundException 与使用 HtmlEntity.DeEntitize() 方法

4 回答 4

Related

Reference