我有一个使用moderngov.co.uk 运行的网站(您向他们发送模板,然后他们上传)。我正在尝试爬取这个站点,以便它可以被 Solr 索引并通过一个 drupal 站点进行搜索。我可以抓取绝大多数网站,但由于某种原因我无法抓取这个:http ://scambs.moderngov.co.uk/uuCoverPage.aspx?bcr=1
我得到的具体错误是:
Injector: starting at 2013-10-17 13:32:47
Injector: crawlDb: X-X/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-10-17 13:32:50, elapsed: 00:00:02
Thu, Oct 17, 2013 1:32:50 PM : Iteration 1 of 2
Generating a new segment
Generator: starting at 2013-10-17 13:32:51
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
我不确定它是否与 Nutch 用于解析 html 的正则表达式模式有关,或者是否存在导致问题的重定向,或者完全是其他原因。以下是一些 nutch 配置文件:
这里是 urlfilters:http ://pastebin.com/ZqeZUJa1
系统信息:Windows 7(64 位)Solr 3.6.2 Apache Nutch 1.7
如果有人以前遇到过这个问题,或者可能知道为什么会这样,任何帮助将不胜感激。
谢谢