python - 为 XML 页面使用 Scrapy

Question

我正在尝试从 API 中抓取多个页面来练习和开发我的 XML 抓取。出现的一个问题是，当我尝试抓取格式如下的文档：http: //i.imgur.com/zJqeYvG.png并将其存储为 XML 时，它无法这样做。

因此，在 CMD 中，它会获取在我的计算机上创建 XML 文件的 URL，但其中没有任何内容。

我将如何修复它以回显整个文档甚至部分文档？

我把代码放在下面：

from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from doitapi.items import DoIt
import random

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["do-it.org.uk"]
    start_urls = []
    number = []
    for count in range(100):
        number.append(random.randint(2000000,2500000))


    for i in number:
        start_urls.append("http://www.do-it.org.uk/syndication/opportunities/%d?apiKey=XXXXX-XXXX-XXX-XXX-XXXXX" %i)



       def parse(self, response):
    xxs = XmlXPathSelector(response)
    titles = xxs.register_namespace("d", "http://www.do-it.org.uk/volunteering-opportunity")
    items = []
    for titles in titles:
        item = DoIt()
        item ["url"] = response.url
        item ["name"] = titles.select("//d:title").extract()
        item ["description"] = titles.select("//d:description").extract()
        item ["username"] = titles.select("//d:info-provider/name").extract()
        item ["location"] = titles.select("//d:info-provider/address").extract()
        items.append(item)
    return items

score 4 · Accepted Answer

您的 XML 文件正在使用命名空间“ http://www.do-it.org.uk/volunteering-opportunity ”，因此选择title等，name您有 2 个选择：

要么使用xxs.remove_namespaces()一次，然后使用.select("./title")，.select("./description")等等。
或注册一次命名空间，使用前缀如“doit”，xxs.register_namespace("doit", "http://www.do-it.org.uk/volunteering-opportunity")然后使用.select("./doit:title")等.select("./doit:description")。

有关 XML 命名空间的更多详细信息，请参阅常见问题解答中的此页面和文档中的此页面

python - 为 XML 页面使用 Scrapy

1 回答 1

Related

Reference