What libraries can you advise except HtmlAgilityPack and Tidy?
To be able to apply XPath queries to HTML content, I use either Tidy as console program with some tricks to get C# XmlDocument or Html Agility Pack. Both these libs are outdated - HAP wasn't changed since May-2010 and Tidy since 2008. I had bad experience using HAP because it did not fix document structure by closing tags even after applying next trick:
public static HtmlDocument MakeEmptyDocument()
{
HtmlDocument doc = new HtmlDocument();
doc.OptionAutoCloseOnEnd = true;
doc.OptionFixNestedTags = true;
doc.OptionOutputAsXml = true;
doc.OptionWriteEmptyNodes = true;
return doc;
}
public static HtmlDocument LoadHtmlDocumentFromString(string content)
{
HtmlDocument doc = MakeEmptyDocument();
doc.LoadHtml(content);
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb))
doc.Save(sw);
using (StringReader sw = new StringReader(sb.ToString()))
doc.Load(sw);
return doc;
}
Generally I preferred Tidy but now I have a case when it breaks quite simple document completely and removes BIG content part from it. So it looks like we need alternatives that can be used from .NET .