2

我目前在导入大型 XML 文件时遇到问题,我不知道为什么。我们从合作伙伴那里得到一个大小约为 443MB 的 XML 输出。我得到的错误如下:

PHP Warning:  SimpleXMLElement::__construct(): Entity: line 1: parser error : internal error in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : internal error in /home/imports/catalog.php on line 54
PHP Warning:  SimpleXMLElement::__construct(): ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded. in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct(): ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded. in /home/imports/catalog.php on line 54
PHP Warning:  SimpleXMLElement::__construct():
 ^ in /home/imports/catalog.php on line 54

Warning: SimpleXMLElement::__construct():
 ^ in /home/imports/catalog.php on line 54
PHP Fatal error:  Uncaught exception 'Exception' with message 'String could not be parsed as XML' in /home/imports/catalog.php:54
Stack trace:
#0 /home/imports/catalog.php(54): SimpleXMLElement->__construct('<?xml version="...')
#1 {main}
  thrown in /home/imports/catalog.php on line 54

Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in /home/imports/catalog.php:54
Stack trace:
#0 /home/imports/catalog.php(54): SimpleXMLElement->__construct('<?xml version="...')
#1 {main}
  thrown in /home/imports/catalog.php on line 54

代码的第 54 行很简单:

$xml = new SimpleXMLElement(file_get_contents($_CFG_XML_URL));

据我所知,错误似乎出现在包含ch to marriage, parenting, entrepreneurship, etc will be significantly upgraded.. 不幸的是,这对文件来说还有很长的路要走,而且由于它的大小,很难读取内容。我的大文件阅读器一次读取一行,而这个 XML 都在一行上,所以它无法优雅地处理,即使在具有 32GB RAM 和 64 位编辑器的工作站上也是如此。

我试过几次重新下载文件,但问题总是一样的。我已经将脚本的可用内存翻了一番,但它仍然在同一个地方失败。

因此,我联系了合作伙伴并要求提供此特定项目的 XML,他们提供了以下内容:

<EBook EAN="9792219192201">
    <Title>Success-a-Phobia</Title>
    <SubTitle>Discovering And Conquering Mankinds Most Persuasive, but Unknown, Phobia</SubTitle>
    <Publisher>The Benjamin Consulting Group, LLC</Publisher>
    <PublicationDate>29/09/2012</PublicationDate>
    <Contributors>
        <Contributor Code="A01" Text="By (author)">Benjamin, Marcus D.</Contributor>
    </Contributors>
    <Formats>
        <Format Type="6"/>
    </Formats>
    <ShortDescription>People today still desire to be successful in matters of family, finance or business even though we are in the midst of major social, political and economic challenges. Have you every been at that moment where you wanted to do something significant, yet you were paralyzed from making the necessary choices to realize your dream? Have you experienced failure and are now sitting in the stands, paralyzed from getting back in the &amp;quote;game of life?&amp;quote;  Are you at the verge of a major decision that could affect your life for many years? If you are in this category, this is your book of the year!    With humor, real-life antidotes, real-life examples and solid narration, Marcus Benjamin will guide you toward discovering the most pervasive, yet unknown, phobia in the history of mankind.  Once this phobia is discovered, the second half of the book shows you how to rid yourself of this phobia for good. Not only will this book impact your life, but your approach to marriage, parenting, entrepreneurship, etc will be significantly upgraded.</ShortDescription>
</EBook>

XML 没有给我敲响任何警钟,但很明显 PHP 中途出现了问题。元素内容中似乎有 978 个字符,但这对我来说并没有敲响任何特别的警钟。

PHP 脚本从 Amazon EC2 实例中的命令行运行。操作系统是 Amazon Linux (RHEL)

所以,基本上,我被卡住了。有没有人知道什么可能导致这个问题?

4

2 回答 2

0

978可能不会响起任何铃声,但1000可能!行首有 4 个空格,然后是 18 个字符用于 '<ShortDescription>' 将提供所需的 22 个字符。像 1000 这样的整数可能会使某种缓冲区长度限制更有可能。

于 2012-12-21T22:36:43.240 回答
0

尝试使用 .xml 验证 xml xmllint。它可用作 linux 的命令行工具。

如果文件有效。您应该仔细检查您的memory_limitini var. 请记住,DOM 进程(就像简单的 xml 一样)需要将整个文件保存在内存中。在您的情况下,memory_limit 应设置为至少 500MB。

如果你不能增加你的内存限制,你将不得不考虑一种更少的内存消耗方式来解析 xml。SAX 可能适合这种情况,尽管它需要更多的编程注意。

在 PHP 中,SAX 可通过 xml 扩展获得,并且默认启用。在这里你可以找到文档

于 2012-12-20T20:12:31.860 回答