我需要使用 HtmlAgilityPack 和 C# 解析这个 html 代码。我可以获得 div class="patent_bibdata" 节点,但我不知道如何循环通过子节点。
在这个示例中有 6 个 href,但我需要将它们分成两组;发明家,分类。我对最后两个不感兴趣。此 div 中可以有任意数量的 href。
如您所见,在两组之前有一段文字说明了 href 是什么。
代码片段
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943");
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath);
那么你会怎么做呢?
<div class="patent_bibdata">
<b>Inventors</b>:
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a>,
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a><br>
<b>Current U.S. Classification</b>:
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a><br>
<br>
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://patft.uspto.gov/netacgi/nph-Parser%3FSect2%3DPTO1%26Sect2%3DHITOFF%26p%3D1%26u%3D/netahtml/PTO/search-bool.html%26r%3D1%26f%3DG%26l%3D50%26d%3DPALL%26RefSrch%3Dyes%26Query%3DPN/3748943&usg=AFQjCNGKUic_9BaMHWdCZtCghtG5SYog-A">
View patent at USPTO</a><br>
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://assignments.uspto.gov/assignments/q%3Fdb%3Dpat%26pat%3D3748943&usg=AFQjCNGbD7fvsJjOib3GgdU1gCXKiVjQsw">
Search USPTO Assignment Database
</a><br>
</div>
想要的结果 InventorGroup =
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a>
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Thomas R. Lashley
</a>
分类组
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a>
我试图抓取的页面:http ://www.google.com/patents/US3748943
// 安德斯
PS!我知道在这个页面中,发明者的名字是相同的,但在大多数情况下它们是不同的!