python - 使用 lxml 和 xpath 从 python ElementTree 中提取多个值

Question

我几乎可以肯定这样做是非常错误的，我的问题的原因是我自己的无知，但是阅读 python 文档和示例并没有帮助。

我正在网络抓取。我正在抓取的页面具有以下显着元素：

<div class='parent'>
   <span class='title'>
      <a>THIS IS THE TITLE</a>
   </span>
   <div class='copy'>
      <p>THIS IS THE COPY</p>
   </div>
</div>

我的目标是从 'title' 和 'copy' 中提取文本节点，按其父 div 分组。在上面的例子中，我想检索一个元组('THIS IS THE TITLE', 'THIS IS THE COPY')

下面是我的代码

## 'tree' is the ElementTree of the document I've just pulled 
xpath = "//div[@class='parent']"
filtered_html = tree.xpath(xpath)

arr = []

for i in filtered_html:

   title_filter = "//span[@class='author']/a/text()"  # xpath for title text
   copy_filter = "//div[@class='copy']/p/text()"      # xpath for copy text

   title = i.getroottree().xpath(title_filter)
   copy = i.getroottree().xpath(copy_filter)
   arr.append((title, copy))

我期望filtered_html成为n 个元素的列表（它是）。然后，我尝试遍历该元素列表，并为每个元素将其转换为 ElementTree 并检索标题并使用另一个 xpath 表达式复制文本。因此，在每次迭代中，我期望title成为一个长度为 1 的列表，其中包含元素i的标题文本，并copy成为复制文本的相应列表。

我最终得到的结果是：在每次迭代中，title长度为ntitle_filter的列表包含文档中与xpath 表达式匹配的所有元素，并且是复制文本copy的长度为n的相应列表。

我敢肯定，到现在为止，任何知道他们在用 xpath 和 etree 做什么的人都可以认识到我在做一些可怕、错误和愚蠢的事情。如果是这样，他们能告诉我我应该怎么做吗？

score 2 · Accepted Answer

您的核心问题是getroottree您对每个文本元素进行的调用会使您重置为在整个树上运行 xpath。 getroottree就像它听起来的那样 - 返回您调用它的元素的根元素树。如果你离开那个电话，在我看来你会得到你想要的。

我个人会iterfind在我的主循环中使用元素树上的方法，并且可能会findtext在结果元素上使用该方法来确保我只收到一个标题和一份副本。

我的（未经测试！）代码如下所示：

parent_div_xpath = "//div[@class='parent']"
title_filter = "//span[@class='title']/a"
copy_filter = "//div[@class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]

或者，您可以完全跳过显式迭代：

title_filter = "//div[@class='parent']/span[@class='title']/a/text()"
copy_filter = "//div[@class='parent']/div[@class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))

您可能需要text()从 xpath 中删除调用并将其移动到生成器表达式中，我不确定是否findall会尊重它。如果没有，类似于：

arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))

如果父 div 中可能有多个标题/副本对，您可能需要调整该 xpath。

python - 使用 lxml 和 xpath 从 python ElementTree 中提取多个值

1 回答 1

Related

Reference