c# - 在 C# 中解析网页的有用内容

Question

可能重复：
解析网页

我正在尝试用 C# 解析网页的内容。这是我使用的代码：

WebRequest request = WebRequest.Create("URL");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
    html = sr.ReadToEnd();
}

但问题是我得到了 html 包含的所有数据。

您对如何以“干净”的方式获取有用的数据有什么建议，或者我必须构建自己的解析器？例如：包含标题和与之相关的文本的帖子，类似博客的格式。

score 5 · Accepted Answer

如果您确实尝试从网页解析博客文章，请不要那样做，甚至不要考虑使用 HTML Agility Pack。

相反，您应该使用.Net 框架（自 v3.5 起）中已内置的SyndicationFeed和相关类。这些是为消费和拆分 RSS 提要量身定制的。

score 4 · Accepted Answer

只需使用Html Agility Pack即可。太强大了！

您可以在 Internet 上找到许多教程，例如http://runtingsproper.blogspot.fr/2009/09/htmlagilitypack-article-series.html

score 1 · Accepted Answer

使用Regex. 要解析两个标签之间的数据（我假设您想要这样做），您可以执行以下操作：

string match = Regex.Match(data, string.Format("<a>(?<inbetween>.+?)</a>")).Groups["inbetween"].Value;

与敏捷包不同，使用 aRegex不需要外部依赖项，这对于可移植的独立应用程序非常有用。

c# - 在 C# 中解析网页的有用内容

3 回答 3

Related

Reference