c# - 来自 HttpWebResponse 的 GetElementById

Question

最近我对网页抓取感到“有趣”。我想使用的网站没有 API，所以我别无选择，我必须这样做。

我遇到的问题之一是阅读 HTML 树的元素（我的意思是标签、内部文本和类似的东西）。我使用HttpWebRequestandHttpWebResponse向服务器发送GET/POST请求。

让webResponse我可以通过这种方式阅读 HTML 源代码：

StreamReader sr = new StreamReader(webResponse.GetResponseStream(), Encoding.UTF8);
string sourceCode = sr.ReadToEnd();

我需要的是value这个input标签：

<form action="/file.php" method="post">
    <input name="abc" id="abc" type="hidden" value="some_random_value" />
</form>

我怎样才能做到这一点？

score 2 · Accepted Answer

一种方法是使用 HTML Parser 解析 HTML，然后使用 XPath 简单地选择所需的元素。

这比尝试从包含 HTML 的字符串中正则表达式相关代码要干净得多。

http://htmlagilitypack.codeplex.com/

score 1 · Accepted Answer

我会使用HtmlAgilityPack

string html = @"<form action=""/file.php"" method=""post"">
                <input name=""abc"" id=""abc"" type=""hidden"" value=""some_random_value"" />
                </form>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

//Xpath
var value1 = doc.DocumentNode.SelectSingleNode("//input[@id='abc']")
                             .Attributes["value"].Value;

//Linq
var value2 = doc.DocumentNode.Descendants("input")
                .First(i => i.Attributes["id"] != null && 
                            i.Attributes["id"].Value == "abc")
                .Attributes["value"].Value;

c# - 来自 HttpWebResponse 的 GetElementById

2 回答 2

Related

Reference