python-2.7 - 如何使用scrapy从css定义中提取背景URL

Question

我怎样才能image_1.png从这样的东西中提取（用scrapy）：

<html><body>
<style type="text/css">
img.article_image[class] 
{
    background-image:url('/article_images/image_1.png');
}
</style>    
<img class="article_image">
</body></html>

我想到的唯一想法是正则表达式源 html 代码，还有什么更优雅的吗？

score 2 · Accepted Answer

项目加载器是一个很棒的工具，它们内置了 xpath 和 regex 的东西。

XPathItemLoader(response).get_xpath(xpath, regex)

http://doc.scrapy.org/en/latest/topics/loaders.html

>>> from scrapy.contrib.loader import XPathItemLoader
>>> response.body
'<html><body>\n<style type="text/css">\nimg.article_image[class] \n{\n...'
>>> from scrapy.contrib.loader import XPathItemLoader
>>> xl = XPathItemLoader(response=response, item={'image': ''})
>>> xl
<scrapy.contrib.loader.XPathItemLoader object at 0x7f5830079f50>
>>> xl.get_xpath('//style', re=r"background-image.*/([^/]+)'")
[u'image_1.png']
>>> xl.add_xpath('image', '//style', re=r"background-image.*/([^/]+)'")
>>> xl.load_item()
{'image': [u'image_1.png']}

score 1 · Accepted Answer

您可以通过 xpath 查询找到 css，但您仍然必须使用正则表达式从中提取图像路径。

所以我认为在全身使用正则表达式是一个很好的解决方案。

python-2.7 - 如何使用scrapy从css定义中提取背景URL

2 回答 2

Related

Reference