1

我有一个使用moderngov.co.uk 运行的网站(您向他们发送模板,然后他们上传)。我正在尝试爬取这个站点,以便它可以被 Solr 索引并通过一个 drupal 站点进行搜索。我可以抓取绝大多数网站,但由于某种原因我无法抓取这个:http ://scambs.moderngov.co.uk/uuCoverPage.aspx?bcr=1

我得到的具体错误是:

Injector: starting at 2013-10-17 13:32:47
Injector: crawlDb: X-X/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-10-17 13:32:50, elapsed: 00:00:02
Thu, Oct 17, 2013 1:32:50 PM : Iteration 1 of 2
Generating a new segment
Generator: starting at 2013-10-17 13:32:51
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

我不确定它是否与 Nutch 用于解析 html 的正则表达式模式有关,或者是否存在导致问题的重定向,或者完全是其他原因。以下是一些 nutch 配置文件:

这里是 urlfilters:http ://pastebin.com/ZqeZUJa1

系统信息:Windows 7(64 位)Solr 3.6.2 Apache Nutch 1.7

如果有人以前遇到过这个问题,或者可能知道为什么会这样,任何帮助将不胜感激。

谢谢

4

2 回答 2

1

我尝试了该种子网址,但出现此错误:

Denied by robots.txt: http://scambs.moderngov.co.uk/uuCoverPage.aspx?bcr=1

查看该站点的 robots.txt 文件:

# Disallow all webbot searching 
User-agent: *
Disallow: /

您必须在 Nutch 中设置特定的用户代理并修改网站以接受来自您的用户代理的爬网。

Nutch 中要更改的属性在 conf/nutch-site.xml 中:

<property>
  <name>http.agent.name</name>
  <value>nutch</value>
</property>
于 2013-10-20T21:35:54.107 回答
0

尝试这个

     <property> 
   <name>db.fetch.schedule.class</name> 
   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value> 
  </property>

<property>
  <name>db.fetch.interval.default</name>
  <value>10</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>
  <property>
  <name>db.fetch.interval.max</name>
          <!-- for now always re-fetch everything -->
  <value>100</value>
  <description>The maximum number of seconds between re-fetches of a page
  (less than one day). After this period every page in the db will be re-tried, no
   matter what is its status.
  </description>
</property>
于 2014-11-25T08:17:49.773 回答