1

What libraries can you advise except HtmlAgilityPack and Tidy?

To be able to apply XPath queries to HTML content, I use either Tidy as console program with some tricks to get C# XmlDocument or Html Agility Pack. Both these libs are outdated - HAP wasn't changed since May-2010 and Tidy since 2008. I had bad experience using HAP because it did not fix document structure by closing tags even after applying next trick:

public static HtmlDocument MakeEmptyDocument()
{
    HtmlDocument doc = new HtmlDocument();
    doc.OptionAutoCloseOnEnd = true;
    doc.OptionFixNestedTags = true;
    doc.OptionOutputAsXml = true;
    doc.OptionWriteEmptyNodes = true;
    return doc;
}

public static HtmlDocument LoadHtmlDocumentFromString(string content)
{
    HtmlDocument doc = MakeEmptyDocument();
    doc.LoadHtml(content);
    StringBuilder sb = new StringBuilder();
    using (StringWriter sw = new StringWriter(sb))
        doc.Save(sw);

    using (StringReader sw = new StringReader(sb.ToString()))
        doc.Load(sw);
    return doc;
}

Generally I preferred Tidy but now I have a case when it breaks quite simple document completely and removes BIG content part from it. So it looks like we need alternatives that can be used from .NET .

4

1 回答 1

0

Tidy 项目已被 HTACG(HTML Tidy Advocacy Community Group)接管,现在已经发布了带有 libtidy 库的 tidy5(截至 2015 年底),这些库提供了一个“可以从大量编程语言中调用”的 C 接口。请参阅以下内容:

HTML Tidy 项目(开发者部分)

于 2016-01-09T20:52:10.307 回答