3

我在使用带有 Scrapy 的 XPath 时遇到了一些问题。

我正在查看表格中的链接 - 在浏览器中,它会在查看元素时列出完整链接。但是,scrapy shell 正在切断链接的末端。

表中的示例链接:

    http://www.ashp.org/DrugShortages/Current/Bulletin.aspx?id=463

检查元素时:

    <a href="/DrugShortages/Current/Bulletin.aspx?id=463">

在 scrapy shell 中提取会删除 463。

有任何想法吗?

这是蜘蛛的代码。实际上还没有设置它来爬取链接,我想我会先用正确的 XPath 语法设置所有东西。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from ashp.items import AshpItem

    class MySpider(BaseSpider):
    name = "ashp"
    allowed_domains = ["ashp.org"]
    start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
            title = titles.select("a/text()").extract()
            link = titles.select("a/@href").extract()
            print title, link
4

1 回答 1

2

我认为您的 xpath 不正确。这是一个打印Bulletin页面上所有链接的蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
    name = "ashp"
    allowed_domains = ["ashp.org"]
    start_urls = ["http://ashp.org/menu/DrugShortages/CurrentShortages"]    

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//div[@id='Mid_3Col']/div/table/tr/td/a")
        for link in links:
            title = link.select("text()").extract()[0]
            link = link.select("@href").extract()[0]
            print title, link

输出:

Acetazolamide Injection /DrugShortages/Current/Bulletin.aspx?id=463 
Acetylcysteine Inhalation Solution /DrugShortages/Current/Bulletin.aspx?id=932 
Acyclovir Injection /DrugShortages/Current/Bulletin.aspx?id=467 
Adenosine Injection /DrugShortages/Current/Bulletin.aspx?id=976 
Alcohol Dehydrated Injection (Ethanol) /DrugShortages/Current/Bulletin.aspx?id=778 
Allopurinol Injection /DrugShortages/Current/Bulletin.aspx?id=998
...
于 2013-09-08T17:26:21.707 回答