0

我从抓取的 html 文件中做这个 praser。这个解析器假设提取线程标题、用户帖子和总视图。我设法获得了 html 标记,但问题是它无法检索所有线程标题,而只能得到一些。

html 代码(对不起,我从网站源代码中复制了糟糕的对齐方式):

<tbody id="threadbits_forum_2">

<tr>
<td class="alt1" id="td_threadstatusicon_3396832">

    <img src="http://www.hardwarezone.com.sg/img/forums/hwz/statusicon/thread_hot.gif" id="thread_statusicon_3396832" alt="" border="" />
</td>

    <td class="alt2">&nbsp;</td>


<td class="alt1" id="td_threadtitle_3396832" title="Updated on 3 October 2011  

Please check Price Guides for latest prices 

 A PC Buyer&#8217;s Guide that is everything to everyone is simply not possible. This     is a simple guide to putting together a PC with a local flavour. Be sure to read PC Buyer&#8217;s Guide from other media.  

If you have any...">


    <div>

            <span style="float:right">






                 <img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/sticky.gif" alt="Sticky Thread" /> 
            </span>



        <font color=red><b>Sticky: </b></font>


        <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832" id="thread_title_3396832">Buyer's Guide II: Extreme, High-End, Mid-Range, Budget, and Entry Level Systems - Part 2</a>
        <span class="smallfont" style="white-space:nowrap">(<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/multipage.gif" alt="Multi-page thread" border="0" />  <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832">1</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=2">2</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=3">3</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=4">4</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=5">5</a> ... <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;t=3396832&amp;page=17">Last Page</a>)</span>
    </div>



    <div class="smallfont">


            <span style="cursor:pointer" onclick="window.open('member.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&amp;u=39963', '_self')">adrianlee</span>

    </div>

到目前为止我的编码:

 try(BufferedReader br = new BufferedReader(new FileReader(pageThread)))
    {
        String html = "";

        while(br.readLine() != null)
        {
            html += br.readLine() + "\n";
        }

        Document doc = Jsoup.parse(html);
        //To get the thread list

        Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");

        for(Element e: threadsList)
        {
            //To get the title
            System.out.println("Title: " + e.select("a[id^=thread_title]").text());
        }

        System.exit(0);

    }catch(Exception e)
    {
        e.printStackTrace();
    }

结果: 标题:

  • 标题:想成为 HardwareZone 编辑团队的一员吗?
  • 标题:
  • 标题:pa9797 回到 PC wa new Rig !
  • 标题:[EPIC] 安迪森的又一个首创,铂金模块化 PSU
  • 标题:
  • 标题:SLS哪家店买新cpu好?. . . 很快

您有解决此问题的方法吗?

谢谢。

4

1 回答 1

0

在使用 Jsoup 解析网页时,首先应该以正确的方式获取 Web 文档。并不是说你的方式是错误的,而是你让自己变得比必须的更难。

要创建Document网页的对象,请从

String url = "www.google.com";
Document doc = Jsoup.connect(url).get();

从此文档中,您可以进行选择,例如论坛的主题标题。直接来自食谱的另一个示例是href链接。

Elements links = doc.select("a[href]"); //a with href

如果你没有得到你想要的元素,那么你的选择是不正确的。

在这里,您从ID 以 开头的<tr>所有 -elements 中选择 -elements 。<tbody>threadbits_forum

Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");

由于我不知道您要解析哪个论坛,因此我只能查看可能具有类似 HTML 布局的其他线程位论坛。

如果您查看此站点http://forums.hardwarezone.com.sg/corbell-ecustomer-service-center-166/,您会看到所有主题都在一个<td>名为的类alt1中。

如果您只选择这个,您将获得使用同一类的用户名和其他内容,但由于您只想要线程标题,因此您还必须选择<a>-tag。

这以以下选择查询结束

Elements titles = doc.select("td.alt1  a[id^=thread_title]");

在这个论坛上模拟你原来的问题,你可以这样做:

    String html = "http://forums.hardwarezone.com.sg/corbell-ecustomer-service-center-166/";
    Document doc = Jsoup.connect(html).get();
    Elements titles = doc.select("td.alt1  a[id^=thread_title]");
    for (Element e : titles) {
        System.out.println(e.text());
    }

这将导致标题:

Corbell Product Warranty Policy
=MSI TwinFrozr user OC database=
Notification : Change in MSI notebook service center
RE : Forums contact window and sales & RMA reserved items @ service center :
Corbell office location. (Thanks to tayts1)
[Corbell] 'Like' Our Facebook Page and Get A chance To Win Attractive Prizes
MSI R7970 lightning problem (1 fan not spinning)
...
...

希望这将帮助您正确选择!

于 2013-08-05T21:34:03.820 回答