c# - HtmlAgilityPack 解析文本块

Question

我正在制作一个小型网络分析工具，需要以某种方式提取给定 url 上包含超过 X 个单词的所有文本块。

我目前使用的方法是这样的：

        public string getAllText(string _html)
    {
        string _allText = "";
        try
        {
            HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
            document.LoadHtml(_html);


            var root = document.DocumentNode;
            var sb = new StringBuilder();
            foreach (var node in root.DescendantNodesAndSelf())
            {
                if (!node.HasChildNodes)
                {
                    string text = node.InnerText;
                    if (!string.IsNullOrEmpty(text))
                        sb.AppendLine(text.Trim());
                }
            }

            _allText = sb.ToString();

        }
        catch (Exception)
        {
        }

        _allText = System.Web.HttpUtility.HtmlDecode(_allText);

        return _allText;
    }

这里的问题是我返回了所有文本，即使它是一个很长的文本、一个带有 3 个单词的页脚文本等。

我想分析页面上的实际内容，所以我的想法是以某种方式只解析可能是内容的文本（即超过 X 个单词的文本块）

有什么想法可以实现吗？

score 1 · Accepted Answer

好吧，第一种方法可以是使用string.Split函数对每个node.InnerText值进行简单的字数分析：

string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);

并仅附加words.Length大于 3 的文本。

另请参阅此问题答案以了解原始文本收集的更多技巧。

c# - HtmlAgilityPack 解析文本块

1 回答 1

Related

Reference