0

我一直在尝试合并类似标签的值并将输出作为单个标签获取,如下所示。

xml输入:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String1</Title> 
             <Title>String2</Title> 
             <Title>String3</Title> 
             <Title>String4</Title> 
             <Title>String5</Title> 
             <Title>String6</Title> 
             <Title>String7</Title> 
             <Title>String8</Title> 
         </slide>
     </data>
 </root>

预期输出:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String</Title>
        </slide>
     </data>
 </root>

任何帮助将非常感激。谢谢你!!

4

1 回答 1

0

您需要递归地对常用标签进行分组。这是允许传递函数的实现,该函数决定如何处理文本:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import itertools
import operator
import os.path

from lxml import etree


text = """
<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String1</Title> 
             <Title>String2</Title> 
             <Title>String3</Title> 
             <Title>String4</Title> 
             <Title>String5</Title> 
             <Title>String6</Title> 
             <Title>String7</Title> 
             <Title>String8</Title> 
        </slide>
    </data>
</root>
"""


def combine_elements(elements, combine_text=', '.join):
    result = []
    for key, group in itertools.groupby(elements, operator.attrgetter('tag')):
        items = list(group)
        first_item = items[0]
        # combine only if item don't have children
        if len(items) > 1 and not len(first_item):
            combined = combine_text([el.text for el in items])
            # and if combine_text returned something, e.g. strings have 
            # common prefix
            if combined:
                first_item.text = combined
                result.append(first_item)
                continue
        result.extend(items)
    elements[:] = result
    # recursively combine others
    for element in elements:
        combine_elements(element, combine_text)


doc = etree.fromstring(text)
combine_elements(doc, os.path.commonprefix)
print etree.tostring(doc)

os.path.commonprefix()用作文本组合器,您将获得以下结果:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2</MainTitle> 
             <MainTitle>text3</MainTitle> 
         </slide>
        <slide name="file.xml">
             <Title>String</Title> 
             </slide>
    </data>
</root>

如果您希望所有文本与斜杠/(例如)结合,您可以使用以下内容:

doc = etree.fromstring(text)
combine_elements(doc, ' / '.join)

结果:

<root>
    <data>
        <slide name="file.xml">
             <subtitle>Text1</subtitle> 
             <MainTitle>Text2 / text3</MainTitle> 
             </slide>
        <slide name="file.xml">
             <Title>String1 / String2 / String3 / String4 / String5 / String6 / String7 / String8</Title> 
             </slide>
    </data>
</root>
于 2013-05-01T12:03:05.317 回答