我正在尝试将 DMOZ 内容/结构 XML 文件解析到 MySQL 中,但是所有现有的执行此操作的脚本都非常旧并且不能正常工作。如何在 PHP 中打开一个大的(+1GB)XML 文件进行解析?
11 回答
只有两个 php API 真正适合处理大文件。第一个是旧的expat api,第二个是较新的XMLreader函数。这些 api 读取连续的流,而不是将整个树加载到内存中(simplexml 和 DOM 就是这样做的)。
例如,您可能想查看 DMOZ 目录的部分解析器:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
这是一个与在 PHP 中处理大型 XML 的最佳方式非常相似的问题,但针对 DMOZ 目录解析的特定问题提出了一个非常好的具体答案。但是,由于这对于一般的大型 XML 来说是一个很好的谷歌命中,所以我也会从另一个问题重新发布我的答案:
我的看法:
https://github.com/prewk/XmlStreamer
一个简单的类,它将在流式传输文件时将所有子元素提取到 XML 根元素。在来自 pubmed.com 的 108 MB XML 文件上进行了测试。
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
我最近不得不解析一些相当大的 XML 文档,并且需要一种方法来一次读取一个元素。
如果您有以下文件complex-test.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
并想返回<Object/>
s
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* @package default
* @author Dom Hastings
*/
class Chunk {
/**
* options
*
* @var array Contains all major options
* @access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* @var string The filename being read
* @access public
*/
public $file = '';
/**
* pointer
*
* @var integer The current position the file is being read from
* @access public
*/
public $pointer = 0;
/**
* handle
*
* @var resource The fopen() resource
* @access private
*/
private $handle = null;
/**
* reading
*
* @var boolean Whether the script is currently reading the file
* @access private
*/
private $reading = false;
/**
* readBuffer
*
* @var string Used to make sure start tags aren't missed
* @access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* @param string $file The filename to work with
* @param array $options The options with which to parse the file
* @author Dom Hastings
* @access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a /
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* @return void
* @author Dom Hastings
* @access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* @return string The XML string from $this->file
* @author Dom Hastings
* @access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
我建议使用基于 SAX 的解析器而不是基于 DOM 的解析器。
在 PHP 中使用 SAX 的信息:http ://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
这不是一个很好的解决方案,而只是在那里抛出另一个选项:
您可以将许多大型 XML 文件分成块,尤其是那些实际上只是相似元素列表的文件(我怀疑您正在使用的文件会是)。
例如,如果您的文档看起来像:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
您可以一次读取一两个兆,人为地将<listing>
您加载的几个完整标签包装在根级别标签中,然后通过 simplexml/domxml 加载它们(我在采用这种方法时使用了 domxml)。
坦率地说,如果您使用的是 PHP < 5.1.2,我更喜欢这种方法。在 5.1.2 及更高版本中,可以使用 XMLReader,这可能是最好的选择,但在此之前,您会被上述分块策略或旧的 SAX/expat 库所困扰。我不了解你们其他人,但我讨厌编写/维护 SAX/expat 解析器。
但是请注意,当您的文档不包含许多相同的底层元素时(例如,它适用于任何类型的文件列表或 URL 等,但不会使解析大型 HTML 文档的意义)
这是一个旧帖子,但首先在谷歌搜索结果中,所以我想我根据这个帖子发布另一个解决方案:
http://drib.tech/programming/parse-large-xml-files-php
此解决方案同时使用XMLReader
和SimpleXMLElement
:
$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL = 'the_name_of_your_element';
$xml = new XMLReader();
$xml->open($xmlFile);
// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}
// looping through elements
while($xml->name == $primEL) {
// loading element data into simpleXML object
$element = new SimpleXMLElement($xml->readOuterXML());
// DO STUFF
// moving pointer
$xml->next($primEL);
// clearing current element
unset($element);
} // end while
$xml->close();
为此,您可以将 XMLReader 与 DOM 结合使用。在 PHP 中,两个 API(和 SimpleXML)都基于同一个库 - libxml2。大型 XML 通常是记录列表。因此,您使用 XMLReader 来迭代记录,将单个记录加载到 DOM 中,并使用 DOM 方法和 Xpath 来提取值。关键是方法XMLReader::expand()
。它将 XMLReader 实例中的当前节点及其后代加载为 DOM 节点。
示例 XML:
<books>
<book>
<title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
</book>
<book>
<title isbn="978-0596100506">XML Pocket Reference</title>
</book>
<!-- ... -->
</books>
示例代码:
// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');
// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
continue;
}
// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
// expand the node into the prepared DOM
$book = $reader->expand($document);
// use Xpath expressions to fetch values
var_dump(
$xpath->evaluate('string(title/@isbn)', $book),
$xpath->evaluate('string(title)', $book)
);
// move to the next book sibling node
$reader->next('book');
}
$reader->close();
请注意,展开的节点永远不会附加到 DOM 文档中。它允许 GC 清理它。
这种方法也适用于 XML 名称空间。
$namespaceURI = 'urn:example-books';
$reader = new XMLReader();
$reader->open('books.xml');
$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);
// compare local node name and namespace URI
while (
$reader->read() &&
(
$reader->localName !== 'book' ||
$reader->namespaceURI !== $namespaceURI
)
) {
continue;
}
// iterate the book elements
while ($reader->localName === 'book') {
// validate that they are in the namespace
if ($reader->namespaceURI === $namespaceURI) {
$book = $reader->expand($document);
var_dump(
$xpath->evaluate('string(b:title/@isbn)', $book),
$xpath->evaluate('string(b:title)', $book)
);
}
$reader->next('book');
}
$reader->close();
我已经为 XMLReader 编写了一个包装器(恕我直言),这样可以更轻松地获取所需的信息。包装器允许您关联一组数据元素的路径以及在找到此路径时运行的回调。该路径允许正则表达式并捕获也可以传递给回调的组。
该库位于https://github.com/NigelRel3/XMLReaderReg,也可以使用composer require nigelrel3/xml-reader-reg
.
如何使用它的示例...
$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);
$reader->process([
'(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
echo "1) Value for ".$path[1]." is ".PHP_EOL.
$data->asXML().PHP_EOL;
},
'(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
echo "2) Value for ".$path[1]." is ".PHP_EOL.
$data->ownerDocument->saveXML($data).PHP_EOL;
},
'/root/person2/firstname' => function (string $data): void {
echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
}
]);
$reader->close();
从示例中可以看出,您可以将要传递的数据作为 SimpleXMLElement、DOMElement 或最后一个是字符串。这将仅表示与路径匹配的数据。
路径还显示了如何使用捕获组 -(.*/person(?:\[\d*\])?)
查找任何人员元素(包括元素数组),并$path[1]
在回调中显示找到此特定实例的路径。
库中有一个扩展示例以及单元测试。
我用 2 GB xml 测试了以下代码:
<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
die("Failed to open 'data.xml'");
}
while($reader->read())
{
$node = $reader->expand();
// process $node...
}
$reader->close();
?>
我的解决方案:
$reader = new XMLReader();
$reader->open($fileTMP);
while ($reader->read()) {
if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
$item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA);
//operations on file
}
}
$reader->close();
非常高效的方式是
preg_split('/(<|>)/m', $xmlString);
之后,只需要一个周期。