我已将 nutch/solr 1.6 配置为每 12 小时抓取/索引一个包含大约 4000 个文档和 html 页面的 Intranet。
如果我使用空数据库执行爬虫,则该过程大约需要 30 分钟。当爬取执行几天后,它变得非常缓慢。查看日志文件,似乎今晚最后一步(SolrIndexer)在 1 小时 20 分钟后开始,花了 1 多小时。
因为被索引的文档数量没有增长,我想知道为什么现在这么慢。
Nutch 使用以下命令执行:
bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000
nutch-site.xml 包含:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Internet Site Agent</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<!-- Used only if plugin parse-metatags is enabled. -->
<property>
<name>metatags.names</name>
<value>description;keywords;published;modified</value>
<description> Names of the metatags to extract, separated by;.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these values are generated
by a parser (see parse-metatags plugin)
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>Set this to false if you start crawling your website from
for example http://www.example.com but you would like to crawl
xyz.example.com. Set it to true otherwise if you want to exclude external links
</description>
</property>
<property>
<name>http.content.limit</name>
<value>10000000</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>1</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>2</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time. Replaces
deprecated parameter 'fetcher.threads.per.host'.
</description>
</property>
<property>
<name>link.delete.gone</name>
<value>true</value>
<description>Whether to delete gone pages from the web graph.</description>
</property>
<property>
<name>link.loops.depth</name>
<value>20</value>
<description>The depth for the loops algorithm.</description>
</property>
<!-- moreindexingfilter plugin properties -->
<property>
<name>moreIndexingFilter.indexMimeTypeParts</name>
<value>false</value>
<description>Determines whether the index-more plugin will split the mime-type
in sub parts, this requires the type field to be multi valued. Set to true for backward
compatibility. False will not split the mime-type.
</description>
</property>
<property>
<name>moreIndexingFilter.mapMimeTypes</name>
<value>false</value>
<description>Determines whether MIME-type mapping is enabled. It takes a
plain text file with mapped MIME-types. With it the user can map both
application/xhtml+xml and text/html to the same target MIME-type so it
can be treated equally in an index. See conf/contenttype-mapping.txt.
</description>
</property>
<!-- Fetch Schedule Configuration -->
<property>
<name>db.fetch.interval.default</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The default number of seconds between re-fetches of a page (less than 1 day).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The maximum number of seconds between re-fetches of a page
(less than one day). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
<!--property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.inc_rate</name>
<value>0.4</value>
<description>If a page is unmodified, its fetchInterval will be
increased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.dec_rate</name>
<value>0.2</value>
<description>If a page is modified, its fetchInterval will be
decreased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.min_interval</name>
<value>60.0</value>
<description>Minimum fetchInterval, in seconds.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.max_interval</name>
<value>31536000.0</value>
<description>Maximum fetchInterval, in seconds (365 days).
NOTE: this is limited by db.fetch.interval.max. Pages with
fetchInterval larger than db.fetch.interval.max
will be fetched anyway.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta</name>
<value>true</value>
<description>If true, try to synchronize with the time of page change.
by shifting the next fetchTime by a fraction (sync_rate) of the difference
between the last modification time, and the last fetch time.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property-->
<property>
<name>fetcher.threads.fetch</name>
<value>1</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/apache-nutch/tmp/</value>
</property>
<!-- Boilerpipe -->
<property>
<name>tika.boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
</configuration>
如您所见,我已将 nutch 配置为始终重新获取所有文档。因为站点很小,所以现在重新获取所有内容应该没问题(第一次只需要 30 分钟......)。
我注意到,在 crawldb/segments 文件夹中,每天都会创建或多或少 40 个新段。数据库的磁盘大小当然增长得非常快。
这是预期的行为吗?是不是配置有问题?