I have some .NET code that ingests HTML files and extracts text from them. I am using HtmlAgilityPack
to do the extraction. Before I wanted to extract most of the text that was there that was there, so it worked fine. Now requirements have changed and I need to only extract text from he main body of the document. So suppose I scraped HTML from a news webpage. I just want the text of the article, not the ads, titles of other albeit related articles, header/footers etc.
It is possible to modify my calls to HtmlAgilityPack
to only extract the main text? Or is there an alternative way to do the extraction?
Here's the current block of code that gets text from HTML:
using HtmlAgilityPack;
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode) node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
So, ideally, what I want is to let HtmlAgilityPack
determine which parts of the input HTML constitute the "main" text block and input only those elements. I do not know what the structure of input HTML will be but I do know that it will vary a lot (before it was a lot more static)