python - 使用scrapy在没有javascript代码的情况下抓取文本

Question

我目前正在使用scrapy设置一堆蜘蛛。这些蜘蛛应该只从目标站点中提取文本（文章、论坛帖子、段落等）。

问题是：有时，我的目标节点包含一个<script>标签，因此抓取的文本包含 javascript 代码。

这是我正在使用的真实示例的链接。在这种情况下，我的目标节点是//td[@id='contenuStory']. 问题是<script>第一个子 div 中有一个标签。

我花了很多时间在网上和 SO 上寻找解决方案，但我找不到任何东西。我希望我没有错过一些明显的东西！

例子

HTML 响应（仅目标节点）：

<div id="content">
    <div id="part1">Some text</div>
    <script>var s = 'javascript I don't want';</script>
    <div id="part2">Some other text</div>
</div>

我想要的物品：

Some text
Some other text

我得到什么：

Some text
var s = 'javascript I don't want';
Some other text

我的代码

给定一个 xpath 选择器，我使用以下函数来提取文本：

def getText(hxs):
    if len(hxs) > 0:
        l = hxs.select('string(.)')
        if len(l) > 0:
            s = l[0].extract().encode('utf-8')
        else:
            s = hxs[0].extract().encode('utf-8')
        return s
    else:
        return 0

我试过使用 XPath 轴（类似的东西child::script）但无济于事。

score 5 · Accepted Answer

尝试 utils 函数w3lib.html：

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))

score 2 · Accepted Answer

您可以在 xPath 表达式之后使用[not (ancestor-or-self::script]。

这不会捕获脚本，但您可以使用它来防止其他类似[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]这样的事情不会捕获任何脚本、noscripts 或任何不属于文本的 css。

例子：

//article//p//text()[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]

score 1 · Accepted Answer

你可以试试这个 XPath 表达式：

hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()

即，其后代的所有子文本节点//td[@id='contenuStory']都不是script节点

要在文本节点之间添加空间，您可以使用以下内容：

u' '.join(
    hxs.select(
        '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)

python - 使用scrapy在没有javascript代码的情况下抓取文本

例子

我的代码

3 回答 3

Related

Reference