2

如何在使用 lxml 解析期间解析外部未解析实体?

这是我的代码示例:

import io

from lxml import etree

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))

注意:我正在使用 lxml >= 3.4

目前我有以下结果:

<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg" >
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>

在这里,ref1实体未解析为“python-logo-small.jpg”。我预计会有<sample src="python-logo-small.jpg"/>。有什么不对?

我也尝试:

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True, load_dtd=True)

但我有同样的结果。

或者,我想自己解决实体。为此,我尝试以这种方式列出实体:

for entity in doc.docinfo.internalDTD.iterentities():
    msg_fmt = "{entity.name!r}, {entity.content!r}, {entity.orig!r}"
    print(msg_fmt.format(entity=entity))

但我只得到实体和符号的名称,而不是实体的定义:

'ref1', 'jpeg', None

如何访问实体的定义?

4

2 回答 2

0

带有未解析实体的 XML 文档看起来没问题。但是未解析的实体不会以您期望的方式得到解决。如果您想<sample src="python-logo-small.jpg"/>在解析后的输出中看到,请使用内部(解析)实体

例子:

import io
from lxml import etree

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!ENTITY ref1 "python-logo-small.jpg">
<!ELEMENT sample EMPTY>
<!ATTLIST sample src CDATA #REQUIRED>
]>
<sample src="&ref1;"/>
"""

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))

输出:

<!DOCTYPE sample [
<!ENTITY ref1 "python-logo-small.jpg">
<!ELEMENT sample EMPTY>
<!ATTLIST sample src CDATA #REQUIRED>
]>
<sample src="python-logo-small.jpg"/>

笔记:

  • ref1实体被声明为内部实体
  • 实体用引用&ref1;
  • src属性被声明为 type CDATA

您可以使用 XSLTunparsed-entity-uri函数获取未解析实体的值 (URI)。要查看它的实际效果,请将以下几行添加到问题中的代码示例中:

xsl = etree.XML('''\
<xsl:stylesheet version="1.0" 
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output encoding="utf-8" omit-xml-declaration="yes"/>
 <xsl:template match="sample">
   <xsl:value-of select="unparsed-entity-uri(@src)"/>
 </xsl:template>
</xsl:stylesheet>
''')

transform = etree.XSLT(xsl)
result = transform(doc)
print result

输出:

python-logo-small.jpg
于 2015-09-24T12:00:23.193 回答
0

好的,不可能“解析”外部未解析的实体,但我们可以列出它们:

import io

import xml.sax

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""


class MyDTDHandler(xml.sax.handler.DTDHandler):
    def __init__(self):
        pass

    def unparsedEntityDecl(self, name, publicId, systemId, ndata):
        print(dict(name=name, publicId=publicId, systemId=systemId, ndata=ndata))
        xml.sax.handler.DTDHandler.unparsedEntityDecl(self, name, publicId, systemId, ndata)


parser = xml.sax.make_parser()
parser.setDTDHandler(MyDTDHandler())
parser.parse(io.BytesIO(content))

结果是:

{'systemId': u'python-logo-small.jpg', 'ndata': u'jpeg', 'publicId': None, 'name': u'ref1'}

这样工作就完成了。

于 2015-09-24T12:59:24.180 回答