0

I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:

  • get $url with curl
  • create dom document
  • parsing out href tags using xpath
  • storing href attributes in $totalurls (that aren't already there)
  • updating $url from $totalurls

Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.

But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.

Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem

4

1 回答 1

0

我的猜测是您的脚本在 30 或 60 秒后超时(默认为 php)可以被覆盖,set_time_limit($num_of_seconds);或者您可以max_execution_time在 php.ini 中更改您的脚本,或者如果您有主机,您可以通过 php 设置更改一些值(或其他它被称为)。

此外,您可能希望将其添加到页面顶部:

error_reporting(E_ALL);
ini_set("display_errors", 1);

并检查您的错误日志以查看是否有与您的蜘蛛有关的消息。

于 2013-02-01T02:38:45.223 回答