1

我正在尝试使用 Crawler4j 抓取 Apache 邮件列表以获取所有存档消息。我提供了一个种子 URL,并试图获取指向其他消息的链接。但是,它似乎没有提取所有链接。

以下是我的种子页面的 HTML ( http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E ) :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>Re: some healthy broker disappear from zookeeper</title>
  <link rel="stylesheet" type="text/css" href="/archives/style.css" />
 </head>

 <body id="archives">
  <h1>kafka-users mailing list archives</h1>

  <h5>
<a href="http://mail-archives.apache.org/mod_mbox/" title="Back to the archives depot">Site index</a> &middot; <a href="/mod_mbox/kafka-users" title="Back to the list index">List index</a></h5>  <table class="static" id="msgview">
   <thead>
    <tr>
    <th class="title">Message view</th>
    <th class="nav"><a href="/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e" title="Previous by date">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e" title="Next by date">&raquo;</a> &middot; <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e" title="Previous by thread">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e" title="Next by thread">&raquo;</a></th>
   </tr>
   </thead>

   <tfoot>
    <tr>
    <th class="title"><a href="#archives">Top</a></th>
    <th class="nav"><a href="/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e" title="Previous by date">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e" title="Next by date">&raquo;</a> &middot; <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e" title="Previous by thread">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a href="/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e" title="Next by thread">&raquo;</a></th>
   </tr>
   </tfoot>

   <tbody>
   <tr class="from">
    <td class="left">From</td>
    <td class="right">Neha Narkhede &lt;neha.narkh...@gmail.com&gt;</td>
   </tr>
   <tr class="subject">
    <td class="left">Subject</td>
    <td class="right">Re: some healthy broker disappear from zookeeper</td>
   </tr>
   <tr class="date">
    <td class="left">Date</td>
    <td class="right">Tue, 20 Nov 2012 19:01:56 GMT</td>
   </tr>
   <tr class="contents"><td colspan="2"><pre>
zookeeper server version is 3.3.3 is pretty buggy and has known
session expiration and unexpected ephemeral node deletion bugs.
Please upgrade to 3.3.4 and retry.

Thanks,
Neha

On Tue, Nov 20, 2012 at 10:42 AM, Xiaoyu Wang &lt;xwang@rocketfuel.com&gt; wrote:
&gt; Hello everybody,
&gt;
&gt; We have run into this problem a few times in the past week. The symptom is
&gt; some broker disappear from zookeeper. The broker appears to be healthy.
&gt; After that, producers start producing lots of ZK producer cache stale log
&gt; and stop making any progress.
&gt;  "logger.info("Try #" + numRetries + " ZK producer cache is stale.
&gt; Refreshing it by reading from ZK again")"
&gt;
&gt; We are running kafka 0.7.1 and the zookeeper server version is 3.3.3.
&gt;
&gt; The missing broker will show up in zookeeper after we restart it. My
&gt; question is
&gt;
&gt;    1. Did anyone encounter the same problem? how did you fix it?
&gt;    2. Why producer is not making any progress? Can we make the producer
&gt;    work with those brokers that are listed in zookeeper.
&gt;
&gt;
&gt; Thanks,
&gt;
&gt; -Xiaoyu

</pre></td></tr>
   <tr class="mime">
    <td class="left">Mime</td>
    <td class="right">
<ul>
<li><a rel="nofollow" href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg@mail.gmail.com%3e/">Unnamed text/plain</a> (inline, None, 1037 bytes)</li>
</ul>
</td>
</tr>
   <tr class="raw">
    <td class="left"></td>
    <td class="right"><a href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg@mail.gmail.com%3e" rel="nofollow">View raw message</a></td>
   </tr>
   </tbody>
  </table>
 </body>
</html>

这些是 Crawler4j 识别的传出 URL。

http://mail-archives.apache.org/archives/style.css
http://mail-archives.apache.org/mod_mbox/
http://mail-archives.apache.org/mod_mbox/kafka-users
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread

但是,我感兴趣的 URL 丢失了。

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg@mail.gmail.com%3e

我究竟做错了什么?如何让 Crawler4j 提取我需要的 URL?

4

2 回答 2

2

请告诉我,您注意到有用于下载邮件列表的mbox文件的直接链接。在您的情况下,只需 wget 这个,不需要爬虫:

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox

于 2014-12-17T11:46:11.933 回答
0

您可能提供了错误的种子页面。我认为您的种子页面应该是:

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread

然后使用

public boolean shouldVisit(WebURL url) {
    String href = url.getURL().toLowerCase();
    return (!FILTERS.matcher(href).matches() && href.contains("http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCA"));
}

我希望这会有所帮助。

于 2014-04-14T14:33:34.990 回答