xml - 如何加快 XML::Twig

Question

我XML::Twig用来解析一个非常大的 XML 文档。我想根据<change></change>标签将它分成块。

现在我有：

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

现在，parseChange当它从 XML 中提取该块时，它正在运行该方法。它进展非常缓慢。我针对从文件中读取 XML$/=</change>并编写函数以返回 XML 标记的内容进行了测试，它运行得更快。

是我遗漏了什么还是我使用XML::Twig不正确？我是 Perl 的新手。

编辑：这是更改文件的示例更改。该文件由很多这些一个接一个地组成，它们之间不应该有任何东西：

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>jbq@google.com</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>jbq@google.com</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>

score 3 · Accepted Answer

就目前而言，您的程序正在处理所有XML 文档，包括change您不感兴趣的元素之外的数据。

如果您将twig_handlers构造函数中的参数更改为twig_roots，则将仅为感兴趣的元素构建树结构，其余的将被忽略。

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });

score 1 · Accepted Answer

XML::Twig包括一种机制，您可以在标签出现时对其进行处理，然后丢弃不再需要的内容以释放内存。

这是从文档中获取的示例（其中还有更多有用的信息）：

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

这在处理数千兆字节的文件时可能是必不可少的，因为（同样，根据文档）将整个内容存储在内存中可能会占用文件大小的 10 倍之多。

编辑：基于您编辑的问题的一些评论。在不了解文件结构的情况下，不清楚究竟是什么让您放慢了速度，但这里有一些事情可以尝试：

如果你写了很多行，刷新输出文件句柄会减慢你的速度。Perl 专门出于性能原因缓存文件写入，而您正在绕过它。
与其使用该(?i)机制，一个相当高级的功能可能会降低性能，为什么不让整个匹配不区分大小写呢？/[^a-z0-9]bug[^a-z0-9]/i是等价的。您也可以使用来简化它/\bbug\b/i，这几乎是等效的，唯一的区别是下划线包含在不匹配的类中。
还可以进行其他一些简化以删除中间步骤。

这个处理程序代码与你的速度相比如何？

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}

score 0 · Accepted Answer

不是 XML::Twig 答案，但是...

如果您要从 xml 文件中提取内容，您可能需要考虑 XSLT。<change>使用 xsltproc 和以下 XSL 样式表，我在大约一分钟内从 1Gb 的 s 中得到了包含错误的更改行。我敢肯定，很多改进都是可能的。

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

如果您的 XML 处理可以作为

提取为纯文本
争吵变平的文本
利润

那么 XSLT 可能是该过程中第一步的工具。

score 0 · Accepted Answer

如果您的 XML 真的很大，请使用XML::SAX。它不必将整个数据集加载到内存中；相反，它顺序加载文件并为每个标签生成回调事件。我成功地使用 XML::SAX 来解析大小超过 1GB 的 XML。以下是您的数据的 XML::SAX 处理程序示例：

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>jbq@google.com</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>jbq@google.com</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

输出

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f

score 0 · Accepted Answer

我的要花很长时间。

    my $twig=XML::Twig->new
  (
twig_handlers =>
   {
    SchoolInfo => \&schoolinfo,
   },
   pretty_print => 'indented',
  );

$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');

sub schoolinfo {
  my( $twig, $l)= @_;
  my $rec = {
                 name   => $l->field('SchoolName'),
                 refid  => $l->{'att'}->{RefId},
                 phone  => $l->field('SchoolPhoneNumber'),
                };

  for my $node ( $l->findnodes( '//Street' ) )    { $rec->{street} = $node->text; }
  for my $node ( $l->findnodes( '//Town' ) )      { $rec->{city} = $node->text; }
  for my $node ( $l->findnodes( '//PostCode' ) )  { $rec->{postcode} = $node->text; }
  for my $node ( $l->findnodes( '//Latitude' ) )  { $rec->{lat} = $node->text; }
  for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }     
}

这是 pretty_print 的机会吗？否则它非常简单。

xml - 如何加快 XML::Twig

5 回答 5

Related

Reference