1

I need to crawl a website, which basically has links like this:

www.website.com/link/page_1.html
www.website.com/link/page_2.html
www.website.com/link/page_3.html
...

The scraped content is going directly into the database through pipelines.

It is easy to tell django something like:

if item exists do not insert it, otherwise insert it

But is there any way to scrape the rest of the links which have been added since last scrape?

For example, after website.com inserts new items:

/link/page_1.html becomes /link/page_2.html
new items populate /link/page_1.html

At this point, what do I need to tell scrapy just scrape the new added items since last scrape?

4

1 回答 1

1

最新的 scrapy 支持将请求序列化到磁盘 [1],还有 Rolando 的 Redis 集成 [2]。

于 2012-07-03T22:08:30.717 回答