python - XPath 选择所有但不是 self::strong 和 self::strong/following-sibling::text()

Question

所以我有以下示例 html 来解析。

<div>
    <strong>Title:</strong>
    Sub Editor at NEWS ABC

    <strong>Name:</strong>
    John

    <strong>Where:</strong>
    Everywhere

    <strong>When:</strong>
    Anytime

    <strong>Everything can go down there..</strong>

    Lorem Ipsum blah blah blah....
</div>

我想提取整个 div，除了我不希望 Title 和 Where 和 When 带有以下值。

到目前为止，我已经测试了以下 XPaths。

a）没有跟随兄弟姐妹（1：不工作。2：工作）

1. //div/node()[not(strong[contains(text(), "Title")])]

2. //div/node()[not(self::strong and contains(text(), "Title"))]

a）有以下兄弟姐妹（1：不工作。2：不工作）

1. //div/node()[not(strong[contains(text(), "Title")]) and not(strong[contains(text(), "Title")]/following-sibling::text())]

2. //div/node()[not(self::strong and contains(text(), "Title") and following-sibling::text())]

如何实现我所追求的？

score 3 · Accepted Answer

我认为以下内容符合您想要做的 - 它不包括包含标题的强元素以及它之后的文本节点。您可以扩展它以包含您想要排除的其他强元素：

//div/node()[not(self::strong and contains(text(), "Title") or preceding-sibling::strong[1][contains(text(), "Title")])]

强节点被跳过：

not(self::strong and contains(text(), "Title")

以下文本被跳过：

preceding-sibling::strong[1][contains(text(), "Title")]

请注意，文本节点需要检查其最接近的前同级（而不是其后同级）。

python - XPath 选择所有但不是 self::strong 和 self::strong/following-sibling::text()

1 回答 1

Related

Reference