html - jsoup论坛爬取

Question

我正在使用 Jsoup 抓取一个在线论坛。想知道我应该如何在没有其他评论者引用的情况下抓取主要帖子。

我设法刮掉了什么： carey 写道：是的，CC 通常有折扣，尤其是汽油和 makan... 在汽油亭使用黑色 DBS 借记卡可以得到折扣吗？我总是付现金，因为没有抄送。

我想要什么：在加油站使用的黑色星展银行借记卡可以打折吗？我总是付现金，因为没有抄送。

这是html：

<div id="post_message_63989045">
  <div class="quote"> 
    <span class="byline"> <a href="/eat-drink-man-woman-16/life-without-credit-cards-3601620-post63982949.html#post63982949" rel="nofollow"><img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/buttons/viewpost.gif" border="0" alt="View Post" /></a> <strong>carey</strong> wrote: </span> 
     <blockquote cite="showthread.php?p=63982949#post63982949">
        Yup, CC usually got discounts, especially for petrol and makan...
        <br /> 
        <br /> So those without a CC are being penalized 
        <img src="http://www.hardwarezone.com.sg/img/forums/hwz/smilies/eek.gif" border="0" alt="" title="EEK!" class="inlineimg" /> 
     </blockquote> 
  </div>The black DBS debit card when used at petrol kiosk can get discount ?
  <br /> 
  <br /> I always pay cash because no cc . 
  <img src="http://www.hardwarezone.com.sg/img/forums/hwz/smilies/frown.gif" border="0" alt="" title="Frown" class="inlineimg" />
</div>

score 1 · Accepted Answer

您可以简单地过滤掉<div>具有“引用”类的 s，如果您用于抓取的任何内容都解析 HTML 标记

score 0 · Accepted Answer

如果您可以使用 XPath，您可以只查询所有作为直接子节点的文本节点：

//div[@id="post_message_63989045"]/text()

引号将被忽略，因为它的文本是 quote-div 的子项。（可能以及某人发布的任何代码标签）

score 0 · Accepted Answer

0

评论.ownText()

获取元素拥有的文本。不将文本与所有子项合并

于 2012-07-18T00:58:34.473 回答

html - jsoup论坛爬取

3 回答 3

Related

Reference