python - 如何使用 HtmlXPathSelector (Scrapy) 以 HTML 形式返回结果

Question

如何检索标签中包含的所有 HTML？

hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')

也许是这样的：

hxs.select('//span[@class="title"]/html()')

编辑： 如果我查看文档，我只会看到返回 new 的方法XPathSelectorList，或者只是标签内的原始文本。我想检索的不是新列表或文本，而是标签内的源代码 HTML。例如：

<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="leexample">
            justtext
            <p class="ihatelookingforfeatures">
                sometext
            </p>
            <p class="yahc">
                sometext
            </p>
        </div>
        <div id="lenot">
            blabla
        </div>
    an awfuly long example for this.
    </body>
</html>

我想做一个这样的方法，hxs.select('//div[@id="leexample"]/html()')它将返回其中的 HTML，如下所示：

justtext
<p class="ihatelookingforfeatures">
    sometext
</p>
<p class="yahc">
    sometext
</p>

我希望我清除了围绕我的问题的模棱两可。

如何从HtmlXPathSelectorScrapy 中获取 HTML？（也许是scrapy范围之外的解决方案？）

score 6 · Accepted Answer

致电.extract()您的XpathSelectorList. 它应该返回一个包含你想要的 HTML 内容的 unicode 字符串列表。

hxs.select('//div[@id="leexample"]/*').extract()

更新

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html()不是一个有效的scrapy选择器。要提取所有子项，请使用'//div[@id="leexample"]/*'or '//div[@id="leexample"]/node()'。请注意，node()将返回textNode结果类似于：

[u'\n',
 u'<a href="image1.html">名称：我的图片 1 
'
]

score 3 · Accepted Answer

使用：

//span[@class="title"]/node()

span这将选择所有节点（元素、文本节点、处理指令和注释），它们是XML 文档中其class属性值为的任何元素的子节点"title"。

如果您只想获取span文档中第一个此类的子节点，请使用：

(//span[@class="title"])[1]/node()

score 1 · Accepted Answer

虽然迟到了，但我把这个留作记录。

我要做的是：

html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

或者如果我们想匹配各种节点：

elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]

score 0 · Accepted Answer

它实际上并不像看起来那么难。只需删除 XPath 查询的最后 / ，并使用 extract() 方法。我在中运行了一个示例scrapy shell，这是一个缩短的版本：

sjaak:~ sjaakt$ scrapy shell
2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
>>> fetch('http://www.nu.nl')
2012-07-19 11:06:34+0200 [default] INFO: Spider opened
2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None)
>>> hxs.select("//h1").extract()
[u'<h1>    <script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    </h1>\n    ']
>>>

要仅获取标记的内部内容，请在 XPath 查询中使用 add /*。例子：

>>> hxs.select("//h1/*").extract()
[u'<script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    ']

score 0 · Accepted Answer

类似于@xiaowl 指出的， usinghxs.select('//div[@id="leexample"]').extract()将检索从 xPath 查询中检索到的标记的所有 HTML 内容：//div[@id="leexample"]。

所以为了记录，我最终得到了；

post = postItem() #body = Field #/in item.py
post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract()
open('logs/test.log', 'wb').write(str(post['body']))
#logs.test.log contains all the HTML inside the tag selected by the query.

score 0 · Accepted Answer

一点黑客行为（进入, 在 1.0.5_root中Selector工作的私有财产）：

from lxml import html
def extract_inner_html(sel):
    return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()])

def extract_inner_text(sel):
    return (''.join(sel.css('::text').extract())).strip()

像这样使用它：

reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0])
text = extract_inner_text(statement.css('.politic-statement')[0])
all_text = extract_inner_text(statement.css('.politic-statement'))

我在这个问题中找到了 lxml 代码部分。

python - 如何使用 HtmlXPathSelector (Scrapy) 以 HTML 形式返回结果

6 回答 6

更新

Related

Reference