0

我的目标是获取文本: 27. The method according to claim 23 wherein...
如何检索包含<?. 我相信他们在谷歌搜索时被称为 php 短标签。

我正在使用 lxml、xpaths,他们似乎只是没有将其注册为标签或节点。我尝试了 itertext() 但这也不起作用。

 <claim id="CLM-00027" num="00027">
            <claim-text>                <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.                <?insert-end id="REI-00005" ?></claim-text>
        </claim>
4

2 回答 2

1

这是执行此操作的一段代码,它使用 XPath 到达最深的“有效”标记,然后从getchildren那里tail一直深入到实际文本。

import lxml
xml=""" <claim id="CLM-00027" num="00027">
            <claim-text>                <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.                <?insert-end id="REI-00005" ?></claim-text>
        </claim>"""

root = lxml.etree.fromstring(xml)
e = root.xpath("/claim/claim-text")
res = e[0].getchildren()[0].tail
print(res)

输出:

'27. 24.根据权利要求23所述的方法,其中所述非晶金属选自Zr基合金、Ti基合金、Al基合金、Fe基合金、La基合金、Cu基合金、Mg基合金、Pt基合金和钯基合金。

于 2020-07-01T06:17:15.953 回答
1

通过索引访问特定的子节点。

from xml.etree import ElementTree as ET
tree = ET.parse('path_to_your.xml')

root = tree.getroot()

print(root[0].text)

输出:

        27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.                
于 2020-07-01T06:21:02.050 回答