I am a beginner programmer, designing a spider that crawls pages. Logic goes like this:
- get $url with curl
- create dom document
- parsing out href tags using xpath
- storing href attributes in $totalurls (that aren't already there)
- updating $url from $totalurls
Problem is that after the 10th crawled page the spider says it does not find ANY links on the page, no no one on the next, and so on.
But if I begin with the page that was 10th in previous example it finds all links with no problem but breaks again after 10 urls crawled.
Any idea what might cause this? My guess is something with domdocument, maybe, I am not 100%familiar with that. Or can storing too much data cause trouble? It can be some really beginner issue cause i am brand new - AND clueless. Please give me some advice where to look for problem