4

我目前正在尝试解析 HTML 文档以检索其中的所有脚注;该文件包含几十个。我真的不知道用来提取我想要的所有内容的表达式。问题是,类(例如“calibre34”)在每个文档中都是随机的。查看脚注所在位置的唯一方法是搜索“隐藏”,然后它总是文本,并用 < /td> 标记关闭。下面是 HTML 文档中脚注之一的示例,我想要的只是文本。有任何想法吗?多谢你们!

<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>
4

2 回答 2

4

使用HTMLAgilityPack加载 HTML 文档,然后使用此 XPath 提取脚注:

//td[text()='[hide]']/following-sibling::td

基本上,它所做的是首先选择所有td包含的节点,[hide]然后最后选择它们的下一个兄弟节点。所以接下来td。一旦你有了这个节点集合,你就可以提取它们的内部文本(在 C# 中,在 HtmlAgilityPack 中提供支持)。

于 2012-06-28T19:13:10.353 回答
3

使用 MSHTML 解析 HTML 源代码怎么样?这是演示代码。享受吧。

public class CHtmlPraseDemo
{
    private string strHtmlSource;
    public mshtml.IHTMLDocument2 oHtmlDoc;
    public CHtmlPraseDemo(string url)
    {
        GetWebContent(url);
        oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
        oHtmlDoc.write(strHtmlSource);
    }
    public List<String> GetTdNodes(string TdClassName)
    {
        List<String> listOut = new List<string>();
        IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
        IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
        foreach (IHTMLElement item in iec)
        {
            if (item.className == TdClassName)
            {
                listOut.Add(item.innerHTML);
            }
        }
        return listOut;
    }
    void GetWebContent(string strUrl)
    {
        WebClient wc = new WebClient();
        strHtmlSource = wc.DownloadString(strUrl);
    }



}

class Program
{
 static void Main(string[] args)
    {
        CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");

        Console.Write(oH.oHtmlDoc.title);
        List<string> l = oH.GetTdNodes("x");
        foreach (string n in l)
        {
            Console.WriteLine("new td");
            Console.WriteLine(n.ToString());

        }

        Console.Read();
    }
}
于 2012-06-28T19:44:16.033 回答