20

我正在尝试以内存高效的方式使用 lxml 解析一个巨大的 xml 文件(即从磁盘懒惰地流式传输,而不是将整个文件加载到内存中)。不幸的是,该文件包含一些破坏默认解析器的错误 ascii 字符。如果我设置了recover=True,解析器就可以工作,但是iterparse 方法不采用recover 参数或自定义解析器对象。有谁知道如何使用 iterparse 解析损坏的 xml?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

谢谢你的帮助!

编辑——这是我遇到的编码错误类型的一个例子:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

如您所见,chardet 认为它是一个 ascii 文件,但在这个示例中间有一个“\x1e”,这使得 lxml 引发异常。

4

3 回答 3

47

编辑:

这是一个较旧的答案,我今天会做不同的事情。而且我指的不仅仅是愚蠢的蛇……从那时起,BeutifulSoup4就可以使用了,它真的非常好。我建议任何在这里绊倒的人。


目前接受的答案是,嗯,不是应该做什么。这个问题本身也有一个不好的假设:

parser = lxml.etree.XMLParser(recover=True) #从坏字符中恢复。

实际上是为了从错误的 XMLrecover=True中恢复。但是,有一个“编码”选项可以解决您的问题。

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

就是这样,这就是解决方案。


顺便说一句——对于任何在 python 中解析 XML 的人,尤其是来自第三方来源的人。我知道,我知道,文档很糟糕,而且有很多红鲱鱼;很多不好的建议。

  • lxml.etree.fromstring()?- 那是完美形成的XML,傻
  • 美丽石汤?- 速度慢,并且对自闭标签有一个愚蠢的策略
  • lxml.etree.HTMLParser()?- (因为 xml 已损坏)这是一个秘密 - HTMLParser() 是...一个具有 recover=True 的解析器
  • lxml.html.soupparser?- 编码检测应该更好,但它与 BeautifulSoup 的自闭标签有相同的缺点。也许您可以将 XMLParser 与 BeautifulSoup 的 UnicodeDammit 结合起来
  • UnicodeDammit 和其他 cockamamie 东西来修复编码?- 好吧,UnicodeDammit 有点可爱,我喜欢这个名字,它对 xml 以外的东西很有用,但是如果你用 XMLParser() 做正确的事情,事情通常会得到解决

你可能会尝试从网上提供的各种东西。lxml 文档可能会更好。上面的代码是 90% 的 XML 解析案例所需要的。在这里我重申一下:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

别客气。我的头疼==你的理智。此外,它还具有您可能需要的其他功能,您知道,XML。

于 2012-01-29T02:36:19.560 回答
7

我通过创建一个具有类似文件的对象接口的类来解决这个问题。类的 read() 方法从文件中读取一行,并在将该行返回到 iterparse 之前替换任何“坏字符”。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

我不得不多次编辑 myFile 类,为导致 lxml 阻塞的其他一些字符添加更多 replace() 调用。我认为 lxml 的 SAX 解析也可以工作(似乎支持恢复选项),但这个解决方案就像一个魅力!

于 2010-03-05T20:33:35.107 回答
3

编辑您的问题,说明会发生什么(确切的错误消息和回溯(复制/粘贴,不要从内存中输入)),让您认为“坏 unicode”是问题所在。

获取chardet并将其提供给您的 MySQL 转储。告诉我们它是怎么说的。

向我们展示转储的前 200 到 300 个字节,例如print repr(dump[:300])

更新你写了“”“正如你所看到的,chardet 认为它是一个 ascii 文件,但是在这个例子的中间有一个“\x1e”,这使得 lxml 引发了一个异常。“””

我在这里看不到“坏 unicode”。

chardet 是正确的。是什么让您认为“\x1e”不是 ASCII?它是一个 ASCII 字符,一个名为“记录分隔符”的 C0 控制字符。

错误消息说您的字符无效。这也是正确的。XML 中唯一有效的控制字符是"\t","\r""\n". MySQL 应该对此抱怨和/或为您提供一种逃避它的方法,例如_x001e_(yuk!)

鉴于上下文,看起来该字符可以被删除而不会丢失。您可能希望修复您的数据库,或者您可能希望从转储中删除此类字符(在检查它们是否全部消失之后),或者您可能希望选择一种比 XML 不那么挑剔和不那么庞大的输出格式。

更新 2您可能想要用户iterparse()不是因为这是您的最终目标,而是因为您想节省内存。如果您使用 CSV 之类的格式,则不会出现内存问题。

更新 3针对@Purrell 的评论:

你自己试试吧,伙计。paste.org/3280965

这是馅饼的内容;它值得保存:

from lxml.etree import etree

data = '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)

为了让它运行,需要修复一个导入,并提供另一个。数据太可怕了。没有输出来显示结果。这是一个将数据缩减为基本要素的替代品。全部为有效 XML 字符的 5 段 ASCII 文本(不包括&lt;&gt;)被替换为t1, ..., t5。冒犯\x1e的两侧是t2t3

[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article>&lt;p&gt;t1&lt;/p&gt;&lt;p&gt;t2\x1et3&lt;/p&gt;&lt;p&gt;t4
&lt;/p&gt;&lt;p&gt;t5&lt;/p&gt;</article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'

不是我所说的“恢复”;坏字符后,<>字符消失。

馅饼是在回答我的问题“是什么让您认为 encoding='utf-8' 会解决他的问题?”。这是由“但是有一个“编码”选项可以解决您的问题的语句触发的。但是 encoding=ascii 产生相同的输出。省略编码参数也是如此。这不是编码问题。结案。

于 2010-02-28T22:50:13.367 回答