c# - 如何在 C# 中从 html 中提取信息？

Question

谁能教我从 C# 中的 html 中提取信息？我正在使用 C# 中的 WinRT 类库。

我想从http://lifehacker.com/5923026/remains-of-the-day-google-image-search-gets-knowledge-graph-integration中提取主要内容和图像。

这是部分网站代码，

<html xmlns="http://www.w3.org/1999/xhtml" class="feature_chompcommentimages feature_s3upload feature_switch feature_powwowtest" xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>

  **<title>Remains of the Day: Google Image Search Gets Knowledge Graph Integration</title>**
          <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <meta http-equiv="content-language" content="en" />
  <meta http-equiv="refresh" content="86400" />
  <meta name="robots" content="all" />
                      <meta name="keywords" content="For What It&#039;s Worth, remainders, in brief, Lifehacker" />
                  <meta property="fb:page_id" content="7568536355" />
                              <meta name="title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      **<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />**
                      <link rel="image_src" href="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/original.jpg" />
          <meta property="og:image" content="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/xlarge.jpg" />
                  <meta property="og:site_name" content="Lifehacker"/>
      <meta property="og:title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      <meta property="og:description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS." />
      <meta property="og:type" content="article" />

我可以使用 SyndicationFeed.Title.Text （使用 Windows.Web.Syndication;）来提取当天的遗骸：Google Image Search Gets Knowledge Graph Integration

请帮我提取

<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />*

我还需要提取里面的主要内容

<div id="container"> <script type="text/javascript">

<!-- %JUMP:More &raquo;% --><\/p>\n<ul>\n<li><a href=\"http:\/\/insidesearch.blogspot.com\/2012\/07\/find-smarter-more-comprehensive-search.html\">Find Smarter, More Comprehensive Search by Image Results<\/a> <i>Google updated its Image Search with a couple of new features. One being an expanded view that lets searchers see the text around matching images, and the other being added support for Knowledge Graph to image search results, which means Google will attempt to identity any photo that you upload or link to and provide more information about the subject.<\/i> [Google Blog]<\/li>\n<li>

内容：“通过图像结果查找更智能、更全面的搜索”“Google 更新了它的图像搜索，增加了一些新功能。一个是扩展视图，让搜索者可以看到匹配图像周围的文本，另一个是添加对知识图谱的支持图片搜索结果，这意味着 Google 将尝试识别您上传或链接到的任何照片，并提供有关该主题的更多信息。[Google 博客]”

非常感谢！！

[7/4/12]
抱歉，我正在尝试通过直接从 html 解析或通过先将其转换为 xml 来解析从 html 中提取文本（作为字符串）和图像（链接或 BitmapImage）。

我使用来自 htmlagilitypack.codeplex.com 的 HtmlAgilityPack 和来自 4guysfromrolla.com/articles/011211-1.aspx 的教程。虽然我仍然想知道 Metro 风格的应用程序是否有更好的解决方案，因为 HtmlAgilityPack 缺乏对它的一些支持。例如，它具有将 html 转换为 xml 的方法，但 WinRT 不再支持来自 .NET 的 XmlTextReader。

再次感谢

score 0 · Accepted Answer

0

Jerry，我建议您使用 RSS 库，而不是解析这个 XML。看看RssToolkit。

于 2012-07-03T03:04:44.403 回答

c# - 如何在 C# 中从 html 中提取信息？

1 回答 1

Related

Reference