xml - 使用 XML::Twig 处理巨大文件 (>10 GB) 的性能问题

Question

我必须处理一个巨大的 XML 文件（>10 GB）才能将其转换为 CSV。我正在使用XML::Twig.

该文件包含大约 260 万客户的数据，每个客户将有大约 100 到 150 个字段（取决于客户资料）。

我将一个订阅者的所有值存储在 hash%customer中，处理完成后，我将 hash 的值输出到 CSV 格式的文本文件中。

问题是性能。处理它大约需要 6 到 8 个小时。怎样才能减少？

my $t = XML::Twig->new(
  twig_handlers => {
    'objects/simple'   => \&simpleProcess ,
    'objects/detailed' => \&detailedProcess ,
  },
  twig_roots => { objects => 1}
);

sub simpleProcess {
  my ($t, $simple) = @_;

  %customer= (); #reset the hash
  $customer{id}  = $simple->first_child_text('id');
  $customer{Key} = $simple->first_child_text('Key');
}

详细标签包括多个字段，包括嵌套字段。所以我每次都会调用一个函数来收集不同类型的字段。

sub detailedProcess {
  my ($t, $detailed1) = @_;

  $detailed = $detailed1;
  if ($detailed->has_children('profile11')){ &profile11();}
  if ($detailed->has_children('profile12')){ &profile12();}
  if ($detailed->has_children('profile13')){ &profile13();}
}
sub profile11 {
  foreach $comcb ($detailed->children('profile11')) {
    $customer{COMCBcontrol} = $comcb->first_child_text('ValueID');
  }

其他函数 *(value2, value3) 也是如此。我没有提到其他保持简单的功能。

<objecProfile>
    <simple>
        <id>12345</id>
        <Key>N894FE</Key>
    </simple>
    <detailed>
        <ntype>single</ntype>
        <SubscriberType>genericSubscriber</SubscriberType>
        <odbssm>0</odbssm>
        <osb1>true</osb1>
        <natcrw>true</natcrw>
        <sr>2</sr>
        <Profile11>
            <ValueID>098765</ValueID>
        </Profile11>
        <Profile21>
        <ValueID>098765</ValueID>
        </Profile21>
        <Profile22>
        <ValueID>098765</ValueID>
        </Profile22>
        <Profile61>
            <ValueID>098765</ValueID>
        </Profile61>
    </detailed>
</objectProfile>

现在的问题是：我foreach为每个孩子使用，即使几乎每次孩子实例在整个客户资料中只出现一次。它会导致延迟，还是有任何其他建议可以提高性能？线程等？（我用谷歌搜索，发现线程并没有多大帮助。）

score 2 · Accepted Answer

我建议使用XML::LibXML::Reader. 它非常高效，因为除非您要求它，否则它不会在内存中构建 XML 树，并且基于出色的 LibXML 库。

您将不得不习惯与不同的 API XML::Twig，但 IMO 它仍然相当简单。

这段代码和你自己的代码完全一样，我的时间表明像你展示的那样 1000 万条记录将在 30 分钟内得到处理。

它的工作原理是重复扫描下一个<object>元素（我不确定这是否应该是<objecProfile>因为您的问题不一致），将节点及其后代复制到一个XML::LibXML::Element对象$copy以便可以访问子树，然后将所需的信息提取到%customer.

use strict;
use warnings;

use XML::LibXML::Reader;

my $filename = 'objects.xml';

my $reader = XML::LibXML::Reader->new(location => $filename)
        or die qq(cannot read "$filename": $!);

while ($reader->nextElement('object')) {

    my %customer;

    my $copy = $reader->copyCurrentNode(1);

    my ($simple) = $copy->findnodes('simple');
    $customer{id}  = $simple->findvalue('id');
    $customer{Key} = $simple->findvalue('Key');

    my ($detailed) = $copy->findnodes('detailed');
    $customer{COMCBcontrol} = $detailed->findvalue('(Profile11 | Profile12 | Profile13)/ValueID');

    # Do something with %customer
}

score 1 · Accepted Answer

首先，使用 DProf 或 NYTProf 来确定是什么降低了您的代码速度。但是，我认为主要工作将在 XML 解析器内部，所以我认为这不能大大提高速度。

作为另一种变体，我建议您拆分（不解析），只是将此 XML 分成几部分（需要保存 xml 格式的一致性）并运行ncpu分叉以独立处理每个，生成一些具有聚合值的文件，然后对其进行处理。

或者，您可以将此 XML 转换为无需 XML 解析器即可解析的内容。例如：您似乎需要id、Key、ValueID字段，因此您可以删除输入文件中的“\n”并生成其他文件，每行一个objectProfile 。然后，将每一行输入解析器。这可以让您对一个文件使用多线程处理，因此您将使用所有 CPU。可能字符串</objectProfile>可以用作记录分隔符。需要研究您的 xml 格式才能做出决定。

PS 有人会想用“自己解析 XML 不好”或类似这样的链接来对我投反对票。但是，有时当您有大量高负载或非常大的输入数据时 - 您有一个选择：以“合法”的方式进行；或者在给定的时间内以给定的精度进行。用户/客户不在乎你是怎么做的，他们想要结果。

xml - 使用 XML::Twig 处理巨大文件 (>10 GB) 的性能问题

2 回答 2

Related

Reference