python - lxml、xi:include 和原始文件

Question

我正在使用 lxml 解析包含 xi:include 元素的文件，并且我正在使用 xinclude() 解析包含。

给定一个元素，有没有办法识别该元素最初出现的文件和源代码行？

例如：

from lxml import etree
doc = etree.parse('file.xml')
doc.xinclude()
xpath_expression = ...
elt = doc.xpath(xpath_expression)
# Print file name and source line of `elt` location

score 0 · Accepted Answer

xinclude 扩展将为顶级扩展元素添加一个 xml:base 属性，并且子节点的 elt.base 和 elt.sourceline 也会更新，因此：

print elt.base, elt.sourceline

会给你你想要的。

如果 elt 不是 xinclude 扩展的一部分，则 elt.base 将指向基本文档 ('file.xml')，而 elt.sourceline 将是该文件中的行号。（请注意，如果元素位于多行上，源代码通常似乎实际上指向元素标记结束的行，而不是它开始的行，就像验证错误消息通常指向发生错误的结束标记一样。 )

您可以找到最初的 xincluded 元素并使用以下命令进行检查：

xels = doc.xpath( '//*[@xml:base] )
for x in xels: 
     print x.tag, x.base, x.sourceline
     for c in x.getchildren():
             print c.tag, c.base, c.sourceline

score 0 · Accepted Answer

遗憾的是，当前版本的 lxml 不再包含此功能。但是，我使用简单的自定义加载程序开发了一种解决方法。这是一个测试脚本，它演示了上述方法中的错误以及解决方法。请注意，此方法仅更新xml:base包含文档的根标记的属性。

程序的输出（使用 Python 3.9.1，lxml 4.6.3）：

Included file was source.xml; xinclude reports it as document.xml
Included file was source.xml; workaround reports it as source.xml

这是示例程序。

# Includes
# ========
from pathlib import Path
from textwrap import dedent
from lxml import etree as ElementTree
from lxml import ElementInclude


# Setup
# =====
# Create a sample document, taken from the `Python stdlib 
# <https://docs.python.org/3/library/xml.etree.elementtree.html#id3>`_...
Path("document.xml").write_text(
    dedent(
        """\
        <?xml version="1.0"?>
        <document xmlns:xi="http://www.w3.org/2001/XInclude">
            <xi:include href="source.xml" parse="xml" />
        </document>
        """
    )
)

# ...and the associated include file.
Path("source.xml").write_text("<para>This is a paragraph.</para>")


# Failing xinclude case
# =====================
# Load and xinclude this.
tree = ElementTree.parse("document.xml")
tree.xinclude()

# Show that the ``base`` attribute refers to the top-level 
# ``document.xml``, instead of the xincluded ``source.xml``.
root = tree.getroot()
print(f"Included file was source.xml; xinclude reports it as {root[0].base}")


# Workaround
# ==========
# As a workaround, define a loader which sets the ``xml:base`` of an
# xincluded element. While lxml evidently used to do this, a change
# eliminated this ability per some `discussion 
# <https://mail.gnome.org/archives/xml/2014-April/msg00015.html>`_, 
# which included a rejected patch fixing this problem. `Current source 
# <https://github.com/GNOME/libxml2/blob/master/xinclude.c#L1689>`_ 
# lacks this patch.
def my_loader(href, parse, encoding=None, parser=None):
    ret = ElementInclude._lxml_default_loader(href, parse, encoding, parser)
    ret.attrib["{http://www.w3.org/XML/1998/namespace}base"] = href
    return ret


new_tree = ElementTree.parse("document.xml")
ElementInclude.include(new_tree, loader=my_loader)

new_root = new_tree.getroot()
print(f"Included file was source.xml; workaround reports it as {new_root[0].base}")

python - lxml、xi:include 和原始文件

2 回答 2

Related

Reference