0

我正在尝试使用 lxml 来做到这一点,但最终这是一个关于正确 xpath 的问题。我想从<pgBreak>元素中选择直到其父元素结束,在这种情况下<p>

XML 输入:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
           <pgBreak pgId="2"/>
            some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>

XML 输出:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
       </p>
          <pgBreak pgId="2"/>
       <p>
             some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>
4

1 回答 1

1

你要做的不是微不足道的:你不仅想匹配'pgBreak'元素和所有后续的兄弟姐妹,然后你想将它们移到父范围之外并将兄弟姐妹包装在'p'元素中。好玩的东西。

以下代码应该让您了解如何实现这一点(免责声明:仅示例,需要清理,可能未处理边缘情况)。代码是故意取消注释的,所以你必须弄清楚:)

我稍微修改了输入 XML 以更好地说明功能。

import lxml.etree

text = """
<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
    <pgBreak pgId="2"/>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
    <pgBreak pgId="3"/>
    <p>
       blurb
    </p>
  </p>
</root>
"""

root = lxml.etree.fromstring(text)
for pgbreak in root.xpath('//pgBreak'):
    inner = pgbreak.getparent()
    if inner == root:
        continue
    outer = inner.getparent()
    pgbreak_index = inner.index(pgbreak)
    inner_index = outer.index(inner) + 1
    siblings = inner[pgbreak_index+1:]
    inner.remove(pgbreak)
    outer.insert(inner_index,pgbreak)
    if siblings[0].tag != 'p':
        p = lxml.etree.Element('p')
        p.text = pgbreak.tail
        pgbreak.tail = None
        for node in siblings:
            p.append(node)
        outer.insert(inner_index+1,p)
    else:
        for node in siblings:
            inner_index += 1
            outer.insert(inner_index,node)

输出是:

<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
  </p>
  <pgBreak pgId="2"/>
  <p>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
  </p>
  <pgBreak pgId="3"/>
  <p>
    blurb
  </p>
</root>
于 2013-03-20T23:30:10.017 回答