c# - 从网页中提取数据，将其解析为特定部分并显示它

Question

我已经使用这个网站很长时间来寻找我的问题的答案，但我无法在这个网站上找到答案。

我正在与一个小组项目一起工作。我们将建立一个小型“游戏交易”网站，允许人们注册、放入他们想要交易的游戏、接受他人的交易或请求交易。

我们的网站提前很长时间运行，因此我们正在尝试向网站添加更多内容。我自己想做的一件事是将放入的游戏链接到 Metacritic。

这是我需要做的。我需要（在 Visual Studio 2012 中使用 asp 和 c#）在 metacritic 上获取正确的游戏页面，提取其数据，解析特定部分，然后在我们的页面上显示数据。

本质上，当您选择要交易的游戏时，我们需要一个小 div 来显示游戏的信息和评级。我想以这种方式学习更多，并从这个项目中获得一些我不必开始的东西。

我想知道是否有人可以告诉我从哪里开始。我不知道如何从页面中提取数据。我仍在试图弄清楚是否需要尝试编写一些东西来自动搜索游戏的标题并以这种方式找到页面，或者我是否可以找到某种方法直接进入游戏页面。一旦我得到数据，我不知道如何从中提取我需要的具体信息。

让这件事变得不那么容易的一件事是我正在学习 c++ 以及 c# 和 asp，所以我一直在纠结。如果有人能指出我正确的方向，那将是一个很大的帮助。谢谢

score 55 · Accepted Answer

这个小例子使用HtmlAgilityPack，并使用XPath选择器来获取所需的元素。

protected void Page_Load(object sender, EventArgs e)
{
    string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
    var web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc = web.Load(url);

    string metascore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
    string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
    string summary = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}

获取XPath给定元素的一种简单方法是使用您的网络浏览器（我使用 Chrome）开发人员工具：

打开开发人员工具（Windows 上的或F12++Ctrl或Mac上的Shift++ ）。CCommandShiftC
在页面中选择要为其使用 XPath 的元素。
右键单击“元素”选项卡中的元素。
单击“复制为 XPath”。

您可以在 c# 中完全粘贴它（如我的代码所示），但请确保转义引号。

您必须确保使用一些错误处理技术，因为如果 Web 抓取更改了页面的 HTML 格式，它们可能会导致错误。

编辑

根据@knocte 的建议，这里是 HTMLAgilityPack 的 Nuget 包的链接：

https://www.nuget.org/packages/HtmlAgilityPack/

score 10 · Accepted Answer

我看了，Metacritic.com 没有 API。

您可以使用 HttpWebRequest 以字符串形式获取网站的内容。

using System.Net;
using System.IO;
using System.Windows.Forms;

string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;

try
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    request.Method = "GET";
    response = request.GetResponse();
    reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
    result = reader.ReadToEnd();
}
catch (Exception ex)
{
    // handle error
    MessageBox.Show(ex.Message);
}
finally
{
    if (reader != null)
        reader.Close();
    if (response != null)
        response.Close();
}

然后，您可以利用 Metacritic 对元标记的使用来解析字符串以获得所需的数据。以下是他们在元标记中提供的信息：

OG：标题
OG：类型
OG：网址
OG：图像
OG：站点名称
OG：描述

每个标签的格式为：meta name="og:title" content="In a World..."

score 10 · Accepted Answer

我推荐Dcsoup。它有一个nuget 包，它使用 CSS 选择器，所以如果你使用 jquery 就很熟悉了。我尝试过其他人，但它是我发现的最好和最容易使用的。文档不多，但它是开源的，并且是具有良好文档的 java jsoup 库的一个端口。（此处为 .NET API的文档。）我非常喜欢它。

var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);

// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);

// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);

score 1 · Accepted Answer

我向您推荐WebsiteParser - 它基于 HtmlAgilityPack（Hanlet Escaño 提到），但它使用属性和 css 选择器使网页抓取更容易：

class PersonModel
{
    [Selector("#BirdthDate")]
    [Converter(typeof(DateTimeConverter))]
    public DateTime BirdthDate { get; set; }
}

// ...

PersonModel person = WebContentParser.Parse<PersonModel>(html);

Nuget 链接

c# - 从网页中提取数据，将其解析为特定部分并显示它

4 回答 4

Related

Reference