c# - 使用特定符号（<、>）从 html 加载 HtmlDocument

Question

我正在创建HtmlDocument和使用LoadHtml(string). 我的输入 html 字符串有时包含符号<和>其中。所以html解析不正确，例如：

我的html是
<p>Value < 20 A B C</p>

在这种情况下，我的文档 OutputHtml 是
<p>Value < 20="" a="" b=""></p>

也许我必须在 HtmlDocument 中设置一些标志，但我没有发现任何有用的东西。

PS HtmlNode具有相同的行为。

score 0 · Accepted Answer

解决问题的最好方法是更改字符<（&lt无需更改字符>）

要知道字符何时<是标记，以及何时“小于”，您可以使用if 此处的代码询问：

public static string CreateCorrectHtmlDoc(string htmlDoc)
        {
            int i = 0;
            List<int> index = new List<int>();
            try
            {
                //look for '<' 
                while ((i = htmlDoc.IndexOf("<", i)) != -1)
                {
                    i += 1;
                    //regex to find '<' that is no tag
                    if (Regex.IsMatch(htmlDoc[i].ToString(), "\\d|-") || Regex.IsMatch(htmlDoc[i].ToString(), "[^a-zA-Z!]") && Regex.IsMatch(htmlDoc[i + 1].ToString(), "\\d\\s|-|\\d"))
                    {
                        htmlDoc = htmlDoc.Substring(0, i - 1) + "&lt" + htmlDoc.Substring(i + 1);
                    }
                }
            }
            catch
            {
                Log.Insert("Error: CreateCorrectHtmlDoc");
                return "";
            }
            return htmlDoc;
        }

我正在使用它，它工作得很好

c# - 使用特定符号（<、>）从 html 加载 HtmlDocument

1 回答 1

Related

Reference