python - 使用 lxml 从 python 中的 xml 中删除命名空间和前缀

Question

我有一个 xml 文件需要打开并进行一些更改，其中一项更改是删除命名空间和前缀，然后保存到另一个文件。这是xml：

<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer">
  <provider>some data</provider>
  <language>en-GB</language>
</package>

我可以进行我需要的其他更改，但不知道如何删除命名空间和前缀。这是我需要的 reusklt xml：

<?xml version='1.0' encoding='UTF-8'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>

这是我的脚本，它将打开并解析 xml 并保存它：

metadata = '/Users/user1/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write('/Users/user1/Desktop/Python/done.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')

那么如何在我的脚本中添加代码来删除命名空间和前缀呢？

score 83 · Accepted Answer

我们可以分两步得到想要的输出文档：

从元素名称中删除命名空间 URI
从 XML 树中删除未使用的命名空间声明

示例代码

from lxml import etree

input_xml = """
<package xmlns="http://apple.com/itunes/importer">
  <provider>some data</provider>
  <language>en-GB</language>
  <!-- some comment -->
  <?xml-some-processing-instruction ?>
</package>
"""
root = etree.fromstring(input_xml)

# Iterate through all XML elements
for elem in root.getiterator():
    # Skip comments and processing instructions,
    # because they do not have names
    if not (
        isinstance(elem, etree._Comment)
        or isinstance(elem, etree._ProcessingInstruction)
    ):
        # Remove a namespace URI in the element's name
        elem.tag = etree.QName(elem).localname

# Remove unused namespace declarations
etree.cleanup_namespaces(root)

print(etree.tostring(root).decode())

输出 XML

<package>
  <provider>some data</provider>
  <language>en-GB</language>
  <!-- some comment -->
  <?xml-some-processing-instruction ?>
</package>

解释代码的细节

如文档中所述，我们用于lxml.etree.QName.localname获取元素的本地名称，即没有命名空间 URI 的名称。然后我们用它们的本地名称替换元素的完全限定名称。

某些 XML 元素（例如注释和处理指令）没有名称。因此，我们必须在替换元素名称时跳过这些元素，否则ValueError会引发 a。

最后，我们用于lxml.etree.cleanup_namespaces()从 XML 树中删除未使用的命名空间声明。

score 34 · Accepted Answer

按照 Uku Loskit 的建议替换标签。除此之外，使用lxml.objectify.deannotate。

from lxml import etree, objectify

metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()

####    
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # guard for Comment tags
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####

tree.write('/Users/user1/Desktop/Python/done.xml',
           pretty_print=True, xml_declaration=True, encoding='UTF-8')

注意：某些标签Comment在访问tag属性时会返回一个函数。为此增加了一名警卫。

score 8 · Accepted Answer

import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
    """Remove namespace in the passed document in place."""
    ns = u'{%s}' % namespace
    nsl = len(ns)
    for elem in doc.getiterator():
        if elem.tag.startswith(ns):
            elem.tag = elem.tag[nsl:]

metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()

remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
       pretty_print=True, xml_declaration=True, encoding='UTF-8')

使用此处的代码片段此方法可以轻松扩展以通过搜索以“xmlns”开头的标签来删除任何命名空间属性

score 3 · Accepted Answer

您还可以使用 XSLT 剥离命名空间...

XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*" priority="1">
    <xsl:element name="{local-name()}" namespace="">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}" namespace="">
      <xsl:value-of select="."/>
    </xsl:attribute>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")

new_tree = tree.xslt(xslt)

print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True, 
                     encoding="UTF-8").decode("UTF-8"))

输出

<?xml version='1.0' encoding='UTF-8'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>

score 1 · Accepted Answer

您需要做的就是：

objectify.deannotate(root, cleanup_namespaces=True)

在你获得根之后，通过使用root = tree.getroot()

score 1 · Accepted Answer

这里有另外两种删除命名空间的方法。第一个使用 lxml.etree.QName 帮助程序，而第二个使用正则表达式。这两个函数都允许匹配一个可选的命名空间列表。如果未提供命名空间列表，则删除所有命名空间。属性键也被清除。

from lxml import etree
import re

def remove_namespaces_qname(doc, namespaces=None):

    for el in doc.getiterator():

        # clean tag
        q = etree.QName(el.tag)
        if q is not None:
            if namespaces is not None:
                if q.namespace in namespaces:
                    el.tag = q.localname
            else:
                el.tag = q.localname

            # clean attributes
            for a, v in el.items():
                q = etree.QName(a)
                if q is not None:
                    if namespaces is not None:
                        if q.namespace in namespaces:
                            del el.attrib[a]
                            el.attrib[q.localname] = v
                    else:
                        del el.attrib[a]
                        el.attrib[q.localname] = v
    return doc


def remove_namespace_re(doc, namespaces=None):

    if namespaces is not None:
        ns = list(map(lambda n: u'{%s}' % n, namespaces))

    for el in doc.getiterator():

        # clean tag
        m = re.match(r'({.+})(.+)', el.tag)
        if m is not None:
            if namespaces is not None:
                if m.group(1) in ns:
                    el.tag = m.group(2)
            else:
                el.tag = m.group(2)

            # clean attributes
            for a, v in el.items():
                m = re.match(r'({.+})(.+)', a)
                if m is not None:
                    if namespaces is not None:
                        if m.group(1) in ns:
                            del el.attrib[a]
                            el.attrib[m.group(2)] = v
                    else:
                        del el.attrib[a]
                        el.attrib[m.group(2)] = v
    return doc

score 1 · Accepted Answer

您可以尝试使用 lxml：

# Remove namespace prefixes
for elem in root.getiterator():
    namespace_removed = elem.xpath('local-name()')

score 1 · Accepted Answer

因此，我意识到这是一个较旧的答案，具有很高的投票率和接受度，但是如果您正在阅读LARGE-FILES并发现自己处于与我相同的困境中；我希望这能够帮到你。

实际上，这种方法的问题在于迭代。不管解析器有多快，做任何事情都会说……几十万次会吃掉你的执行时间。话虽如此，归根结底是真正为我考虑问题并了解名称空间的工作原理（或“打算工作”，因为老实说不需要它们）。现在，如果您的 xml真正使用名称空间，这意味着您看到的标签看起来像这样：<xs:table>，那么您需要在此处针对您的用例调整该方法。我还将包括完整的处理方式。

免责声明：凭良心，我不能告诉你在解析 html/xml 时使用正则表达式，去看看 SergiyKolesnikov 的答案，因为它有效，但我有一个边缘情况，所以说......让我们深入研究一些正则表达式！

问题：命名空间剥离需要很长时间......而且大多数时候命名空间只存在于最开始的标记内，或者我们的“根”。因此，在考虑 python 如何读取信息以及我们唯一的问题子节点是根节点时，为什么不利用它来发挥我们的优势。

请注意：我作为示例使用的文件是 lulz 的原始、可怕、非常无意义的结构，其中有数据的承诺。

my_file是我用于我们示例的文件的路径，出于专业原因，我无法与您分享；为了通过这个答案，它已经被缩小了。

import os, sys, subprocess, re, io, json
from lxml import etree

# Your file would be '_biggest_file' if playing along at home
my_file = _biggest_file
meta_stuff = dict(
    exists = os.path.exists(_biggest_file), 
    sizeof = os.path.getsize(_biggest_file),
    extension_is_a_real_thing = any(re.findall("\.(html|xml)$", my_file, re.I)),
    system_thinks_its_a = subprocess.check_output(
        ["file", "-i", _biggest_file]
    ).decode().split(":")[-1:][0].strip()
)


print(json.dumps(meta_stuff, indent = 2))

所以对于初学者来说，大小合适，系统认为它充其量是 html；文件扩展名既不是 xml 也不是 html...


{
  "exists": true,
  "sizeof": 24442371,
  "extension_is_a_real_thing": false,
  "system_thinks_its_a": "text/html; charset=us-ascii"
}

方法：

为了解析一个 xml 文件......它至少应该是 xml，所以如果不存在，我们需要检查并添加一个声明标签
如果我有命名空间.. 那很糟糕，因为我不能使用 xpaths，这就是我想要做的
如果我的文件很大，我应该只在准备解析它之前对我需要清理的最小的可想象部分进行操作。

功能


def speed_read(file_path):

    # We're gonna be low-brow and add our own using this string. It's fine
    _xml_dec = '<?xml version="1.0" encoding="utf-8"?>'
    # Even worse.. rgx for xml here we go
    # 
    # We'll need to extract the very first node that we find in our document, 
    # because for our purposes thats the one we know has the namespace uri's 
    # ie: "attributes"
    #    FiRsT node : <actual_name xmlns:xsi="idontactuallydoanything.com">
    # We're going to pluck out that first node, get the tags actual name
    # which means from:
    #    <actual_name xmlns:xsi="idontactuallydoanything.com">...</actual_name>
    # We pluck:
    #    actual_name
    # Then we're gonna replace the entire tag with one we make from that name
    # by simple string substitution
    # 
    # -> 'starting from the beginning, capture everything between the < and the >'
    _first_node = re.compile('^(\<.*?\>)', re.I|re.M|re.U)
    # -> 'Starting from the beginning, but dont you get me the <, find anything that happens
    #     before the first white-space, which i don't want either man'
    _first_tagname = re.compile('(?<=^\<)(.*?)\S+',re.I|re.M|re.U)
    # open the file context
    with open(file_path, "r", encoding = "utf-8") as f:
        # go ahead and strip leading and trailing, cause why not... plus adds 
        # safety for our regex's
        _raw = f.read().strip()
        # Now, if the file somehow happens to magically have the xml declaration, we
        # wanna go ahead and remove it as we plan to add our own. But for efficiency, 
        # only check the first couple of characters
        if _raw.startswith('<?xml', 0, 5):
            #_raw = re.sub(_xml_dec, '', _raw).strip()
            _raw = re.sub('\<\?xml.*?\?>\n?', '', _raw).strip()
        # Here we grab that first node that has those meaningless namespaces
        root_element = _first_node.search(_raw).group()
        # here we get its name
        first_tag = _first_tagname.search(root_element).group()
        # Here, we rubstitute the entire element, with a new one
        # that only contains the elements name
        _raw = re.sub(root_element, '<{}>'.format(first_tag), _raw)
        # Now we add our declaration tag in the worst way you have ever
        # seen, but I miss sprintf, so this is how i'm rolling. Python is terrible btw
        _raw = "{}{}".format(_xml_dec, _raw)
        # The bytes part here might end up being overkill.. but this has worked 
        # for me consistently so it stays. 
        return etree.parse(io.BytesIO(bytes(bytearray(_raw, encoding = "utf-8"))))



# a good answer from above:

def safe_read(file_path):
    root = etree.parse(file_path)
    for elem in root.getiterator():
        elem.tag = etree.QName(elem).localname
    # Remove unused namespace declarations
    etree.cleanup_namespaces(root)
    return root

基准测试 - 是的，我知道有更好的方法来做到这一点。

import pandas as pd

safe_times = []
for i in range(0,5):
    s = time.time()
    safe_read(_biggest_file)
    safe_times.append(time.time() - s)


fast_times = []
for i in range(0,5):
    s = time.time()
    speed_read(_biggest_file)
    fast_times.append(time.time() - s)


pd.DataFrame({"safe":safe_times, "fast":fast_times})

结果

安全的	快速地
2.36	0.61
2.15	0.58
2.47	0.49
2.94	0.60
2.83	0.53

score 1 · Accepted Answer

在解析 XML 字符串后立即定义并调用以下函数：

from lxml import etree

def clean_xml_namespaces(root):
    for element in root.getiterator():
        if isinstance(element, etree._Comment):
            continue
        element.tag = etree.QName(element).localname
    etree.cleanup_namespaces(root)

注意 - XML 中的注释元素被忽略，因为它们应该被忽略

用法：

xml_content = b'''<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <dependencies>
        <dependency>
            <groupId>org.easytesting</groupId>
            <artifactId>fest-assert</artifactId>
            <version>1.4</version>
        </dependency>

        <!-- this dependency is critical -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.4</version>
        </dependency>
    </dependencies>
</project>
'''

root = etree.fromstring(xml_content)
clean_xml_namespaces(root) 
elements = root.findall(".//dependency")
print(len(elements)) 
# outputs "2", as expected

python - 使用 lxml 从 python 中的 xml 中删除命名空间和前缀

9 回答 9

免责声明：凭良心，我不能告诉你在解析 html/xml 时使用正则表达式，去看看 SergiyKolesnikov 的答案，因为它有效，但我有一个边缘情况，所以说......让我们深入研究一些正则表达式！

功能

基准测试 - 是的，我知道有更好的方法来做到这一点。

结果

Related

Reference