xml - 在 Perl 中读取带有巨大文本节点的 xml 的实用方法

Question

在遇到包含巨大文本节点的 xml 数据文件后，我在我的数据处理脚本中寻找一些方法来读取和评估它们。

xml 文件是分子建模应用程序的 3D 坐标文件，具有以下结构（示例）：

<?xml version="1.0" encoding="UTF-8"?>
<hoomd_xml version="1.4">
   <configuration>
      <position>
        -0.101000   0.011000  -40.000000
        -0.077000   0.008000  -40.469000
        -0.008000   0.001000  -40.934000
        -0.301000   0.033000  -41.157000
         0.213000  -0.023000  -41.348000
         ...
         ... 300,000 to 500,000 lines may follow  >>
         ...
        -0.140000   0.015000  -42.556000
      </position>

      <next_huge_section_of_the_same_pattern>
        ...
        ...
        ...
      </next_huge_section_of_the_same_pattern>

   </configuration>
</hoomd_xml>

每个 xml 文件都包含几个巨大的文本节点，大小在 60MB 到 100MB 之间，具体取决于内容。

我首先尝试了使用XML::Simple的幼稚方法，但加载器最初解析文件需要很长时间：

...
my $data = $xml->XMLin('structure_80mb.xml');
...

并停止“内部错误：巨大的输入查找”，所以这种方法不是很实用。

下一个尝试是使用XML::LibXML进行读取 - 但在这里，初始加载程序会立即退出，并显示错误消息“解析器错误：xmlSAX2Characters：巨大的文本节点”。

在在stackoverflow上写这个主题之前，我为自己编写了一个 q&d 解析器并通过它发送文件（在将 xx MB xml 文件插入 scalar 之后$xml）：

...
# read the <position> data from in-memory xml file
my @Coord = xml_parser_hack('position', $xml);
...

它将每行的数据作为数组返回，在几秒钟内完成，如下所示：

sub xml_parser_hack {
 my ($tagname, $xml) = @_;
 return () unless $xml =~ /^</;

 my @Data = ();
 my ($p0, $p1) = (undef,undef);
 $p0 = $+[0] if $xml =~ /^<$tagname[^>]*>[^\r\n]*[r\n]+/msg; # start tag
 $p1 = $-[0] if $xml =~ /^<\/$tagname[^>]*>/msg;             # end tag
 return () unless defined $p0 && defined $p1;
 my @Lines = split /[\r\n]+/, substr $xml, $p0, $p1-$p0;
 for my $line (@Lines) {
    push @Data, [ split /\s+/, $line ];
 }
 return @Data;
}

到目前为止，这工作正常，但当然不能认为“生产就绪”。

问：如何使用 Perl 模块读取文件？我会选择哪个模块？

提前致谢

rbo

附录：在阅读了 choroba 的评论后，我更深入地研究了 XML::LibXML。文件的打开my $reader = XML::LibXML::Reader->new(location =>'structure_80mb.xml'); 工作，与我之前的想法相反。如果我尝试访问标记下方的文本节点，则会发生错误：

...
while ($reader->read) {
   # bails out in the loop iteration after accessing the <position> tag,
   # if the position's text node is accessed
   #   --  xmlSAX2Characters: huge text node ---
...

score 2 · Accepted Answer

尝试XML::LibXML使用huge解析器选项：

my $doc = XML::LibXML->load_xml(
    location => 'structure_80mb.xml',
    huge     => 1,
);

或者，如果您想使用XML::LibXML::Reader：

my $reader = XML::LibXML::Reader->new(
    location => 'structure_80mb.xml',
    huge     => 1,
);

score 1 · Accepted Answer

我能够使用 XML::LibXML 模拟答案。试试这个，如果它不起作用，请告诉我。我创建了一个元素中包含超过 500k 行的 XML 文档position，并且能够解析它并打印它的内容：

use strict;
use warnings;
use XML::LibXML;

my $xml = XML::LibXML->load_xml(location => '/perl/test.xml');
my $nodes = $xml->findnodes('/hoomd_xml/configuration/position');
print $nodes->[0]->textContent . "\n";
print scalar(@{$nodes}) . "\n";

我使用findnodesXPath 表达式来提取我想要的所有节点。 $nodes只是一个数组引用，因此您可以根据文档中实际拥有的节点数量循环遍历它。

xml - 在 Perl 中读取带有巨大文本节点的 xml 的实用方法

2 回答 2

Related

Reference