我对scrapy很陌生,而且我以前没有使用过正则表达式
以下是我的spider.py
代码
class ExampleSpider(BaseSpider):
name = "test_code
allowed_domains = ["www.example.com"]
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
现在,如果我们查看start_urls
所有三个 url 都是相同的,除了它们在整数值上有所不同,2?, 3?
依此类推,我的意思是根据网站上存在的 url 不受限制,现在我们可以使用 crawlspider,我们可以为 URL 构造正则表达式,如下所示,
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
rules = (
Rule(SgmlLinkExtractor(allow=(........),))),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
start_url
你能指导我吗,我怎样才能为上面的列表构建一个爬虫规则。