xpath - Scrapy + Splash：在内部 html 中抓取元素

Question

我正在使用 Scrapy + Splash 来抓取网页并尝试从谷歌广告横幅和其他广告中提取数据，但我很难通过 xpath 进入它们。

我正在使用Scrpay-Splash API 来呈现页面，以便加载它们的脚本和图像并截取屏幕截图，但似乎谷歌广告横幅是由 JS 脚本创建的，然后将其内容插入到网页中 iframe 内的新 html 文档中，这样：

Splash 确保代码被渲染，所以我不会遇到scrapy 在脚本中读取脚本内容而不是生成的 html 的常见问题——但我似乎无法找到一种方法来指示必要的 XPath到达我需要的元素节点（广告的 href 链接）。

如果我在 google 中检查元素并复制它的 xpath，它只会给我//*[@id="aw0"]，如果 iframe 的 html 就在这里，我觉得这会起作用，但无论我怎么写它都会返回空，我觉得这可能是因为 XPath 没有t 优雅地处理堆叠在 html 文档中的 html 文档。

包含 google 广告代码的 iframe 的 XPath 是//*[@id="google_ads_iframe_/87824813/hola/blogs/home_0"]{the numbers are constant}。

有没有办法将这些 XPaths 堆叠在一起，让scrapy 跟踪到我需要的容器中？或者我应该以其他方式直接解析 Splash 响应对象并且我不能依赖 Response.Xpath/Response.CSS 吗？

score 4 · Accepted Answer

问题是 iframe 内容没有作为 html 的一部分返回。您可以尝试直接（通过其 src）获取 iframe 内容，或者使用带有 iframes=1 选项的 render.json 端点：

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

/execute自 Splash 2.3.3 起，端点不支持获取 iframe 内容。

score 0 · Accepted Answer

处理 iframe 的另一种方法可以是（响应主页）：

    urls = response.css('iframe::attr(src)').extract()
    for url in urls :
            parse the url

这样 iframe 就像普通页面一样被解析，但目前我无法将主页中的 cookie 发送到 iframe 内的 html，这是一个问题

xpath - Scrapy + Splash：在内部 html 中抓取元素

2 回答 2

Related

Reference