1

Probably you have realized by title, I am using scrapy and xpath to extract data. I tried and provided xpaths from file to the spider (to make spider generic - not to edit often) As required, I am able to extract data in the format required.

Further, now I want to check the xpath expression (relative to webpage specified in spider) if the xpath provided is valid or not (incase if the html style has changed, then my xpath will be invalid). Regarding this I want to check my xpath expression before spider starts.

How do I test my xpath's correctness? or is there any way to do truth testing? Please help.

class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["file:///<filepath>.html"]
def __init__(self):
    self.mt = ""
def parse(self, response):
    respDta = dict()
    it_lst = []
    dtData = response.selector.xpath(gx.spcPth[0])
    for ra in dtData:
        comoodityObj = ra.xpath(gx.spcPth[1])
        list = comoodityObj.extract()
        cmdNme = list[0].replace(u'\xa0', u' ')
        cmdNme = cmdNme.replace("Header text: ", '')
        self.populate_item(response, respDta, cmdNme, it_lst, list[0])
    respDta["mt"] = self.mt
    jsonString = json.dumps(respDta, default=lambda o: o.__dict__)
    return jsonString

gx.spcPth gx.spcPth is from other function which provides me xpath. And it has been used in many instances in rest of the code. I need to check xpath expression before spider starts throughout the code, wherever implemented

4

6 回答 6

1

测试 Scrapy 如何使用您提供给蜘蛛的 xpath 的最佳选择是仅使用Scrapy Shell

$ scrapy shell <url>

这将为您提供一个sel可以运行 xpaths 的对象:

>>> sel.xpath('//title/text()')

如果您想要一些真正快速的测试,请安装“XPath Helper”Chrome 扩展。这是我最喜欢的用于快速测试和确定 xpath 的扩展:

XPath 助手

您只需在 Chrome 中访问一个站点,按 Ctrl+Shift+X,然后输入 xpath。您将在右侧看到结果。

于 2014-12-09T13:34:34.817 回答
0

Scrapy shell 是一个交互式 shell,您可以在其中非常快速地尝试和调试您的抓取代码。

参考:http ://doc.scrapy.org/en/latest/topics/shell.html

该外壳用于测试 XPath 或 CSS 表达式,并查看它们如何工作以及它们从您尝试抓取的网页中提取哪些数据

于 2014-12-09T13:32:04.243 回答
0

the shell is the way to go. if needed you can even invoke it within your spider as described in the documentation I found this useful sometimes.

于 2014-12-10T09:28:41.380 回答
0

我明白你在做什么,我只是不明白为什么。运行蜘蛛的整个过程与您的“测试”过程同时进行 - 就像这样简单:如果 xpath 的结果是空的并且它应该返回一些东西,那么就是有问题。你为什么不检查 xpath 结果并使用 scrapy 日志将其标记为警告、错误或严重,无论你想要什么。就这么简单:

from scrapy import log

somedata = response.xpath(my_supper_dupper_xpath)
# we know that this should have captured
# something, so we check
if not somedata:
    log.msg("This should never happen, XPath's are all wrong, OMG!", level=log.CRITICAL)
else:
    # do your actual parsing of the captured data, 
    # that we now know exists  

之后,只需运行你的蜘蛛并放松。当您在输出日志中看到关键消息时,您就会知道该扔砖头了。否则,一切正常。

于 2014-12-09T13:35:43.940 回答
0

您不仅应该确保您有 200 代码响应,还应该检查实际响应是什么:

view(response)

然后,正如 JoneLinux 所说,您需要使用

scrapy shell 'URL'

但不是sel.xpath()

你应该使用:

response.xpath('//YourXpath...')
于 2021-03-15T09:54:37.527 回答
0

这是使用Selectors进行 xpath 验证的简单方法:

from scrapy.selector import Selector

try:
    my_xpath = '//div/some/xpath'
    Selector(text="").xpath(my_xpath)
    print("valid xpath")
except ValueError as e:
    print(e)
于 2020-10-08T00:15:58.870 回答