python - scrapy LxmlLinkExtractor 和相关网址

Question

我应该以我的规则结束的正确网址是： http: //www.lecture-en-ligne.com/towerofgod/168/0/0/1.html

scrapys 从源代码中很好地获取了相对 url：

<a class="table" href="../../towerofgod/168/0/0/1.html">Lire en ligne</a>

但它然后爬得很糟糕，认为双点斜线双点是下一个要获取的网址的一部分......

我应该使用自定义 process_value 转换从 LxmlLinkExtractor 获得的双重相对 url 吗？

scrapy 是否正确处理相对 url，我的意思是这是预期的行为？

2014-12-06 17:20:05+0100 [togspider] 调试：已爬网（200）http://www.lecture-en-ligne.com/manga/towerofgod/>（参考：无）

2014-12-06 17:20:05+0100 [togspider] 调试：重试 http://www.lecture-en-ligne.com/../../towerofgod/160/0/0/1.html> （失败 1 次）：400 错误请求

class TogSpider(CrawlSpider):
name = "togspider"
allowed_domains = ["lecture-en-ligne.com"]
start_urls = ["http://www.lecture-en-ligne.com/manga/towerofgod/"]

rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a'), callback='parse_chapter'),
    )

score 1 · Accepted Answer

问题是 HTML 有一个不正确的HTMLbase元素，它应该为页面中的所有相对链接指定基本 url：

<base href="http://www.lecture-en-ligne.com/"/>

Scrapy 尊重这一点，这就是为什么以这种方式形成链接的原因。

python - scrapy LxmlLinkExtractor 和相关网址

1 回答 1

Related

Reference