xml - LIB:XML for perl 通过 xpath 解析巨大的 xml 文件导致核心分段错误

Question

我有一个巨大的 xml 文件，其格式为

<XML>
<Application id="1" attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>

<Application id="2"attr1="some value" attr2="some val"..and many more attralso with nested tags inside application which might contain more attributes
</Application>

<Application id="3"attr1="some value" attr2="some val"..and many more attr also with nested tags inside application which might contain more attributes
</Application>

 .... probably 10000 more Application entries
</XML>

每个应用程序标签只有属性没有内容，但还包含可以有属性的嵌套标签，我需要解析和提取一些属性。我正在使用以下脚本，它在应用程序标记的一小部分上运行良好，但是当记录变高时会变得非常慢，不幸的是，当我在完整文件上运行它时，它给了我一个分段错误核心转储，甚至一半文件。

这是我的脚本任何关于如何更好地做到这一点的建议将不胜感激。

score 2 · Accepted Answer

我相信你可以让 XML::LibXML::Reader 做到这一点，但我不熟悉它。以下是您将如何使用 XML::Twig 来实现的。

我只是为您提供了如何获取Application元素内部数据的示例。

 #!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

$filename1 = "exam.xml";

my $parser = XML::Twig->new( twig_handlers => { Application => \&process_application })
                        ->parsefile($filename1);

sub process_application
  { my( $t, $sample)= @_;
    my $hncid    = $sample->att('ID);                     # get an attribute
    my @persons  = $sample->children( 'Person');
    my @aplnamt  = map { $_->att( 'APLN') } @persons;     # that's how you get all attribute values 
    my @students = $sample->findnodes( './Person/Student');
    my @nsschl   = map { $_->att('NS') } @students;
    my @d81      = $sample->descendant('*[@D8CHRG]'); 
    my @d81      = $sample->findnodes('.//*[@D8CHRG]');   # you can use a subset of XPath

    $t->purge;                                           # this is where you free the memory
  }

现在想起来，其实可以使用 XML::Twig::XPath 来获得 XPath 的全部功能，我只是更习惯于 XML::Twig 的原生导航方法。

score 1 · Accepted Answer

我认为您的问题是 libXML 是基于树的解析器，因此您的整个文档都被读入内存。您可以调查基于流的解析器并构建您自己的所需结构

score 0 · Accepted Answer

这是测试：输入xml文件：test2.xml

<?xml version="1.0" encoding="UTF-8"?>
<metabolite>
  <version>3.6</version>
  <creation_date>2005-11-16 15:48:42 UTC</creation_date>
  <update_date>2014-06-11 23:17:42 UTC</update_date>
  <accession>HMDB00001</accession>
  <secondary_accessions>
    <accession>HMDB04935</accession>
    <accession>HMDB06703</accession>
    <accession>HMDB06704</accession>
  </secondary_accessions>
  <name>1-Methylhistidine</name>
</metabolite>

这是我的 perl 脚本：parse_hmdb_metabolites_xml.pl

#!/usr/bin/perl -w 

use strict;
use Getopt::Long;
use XML::Simple;

my $usage= "\n$0 
--xml     \t<str>\thmdb xml file
--outf    \t<str>\toutput file
\n";

my($xml,$outf);

GetOptions(
                "xml:s"=>\$xml,
                "outf:s"=>\$outf
);

die $usage if !defined $xml;

print "$xml\n";
my $cust_xml = XMLin($xml);

这是测试输出：

perl parse_hmdb_metabolites_xml.pl  --xml test2.xml
test2.xml
Segmentation fault (core dumped)

我会测试XML::libXML

xml - LIB:XML for perl 通过 xpath 解析巨大的 xml 文件导致核心分段错误

3 回答 3

Related

Reference