0

我有一个这样的 XML 文件。

文件的每一行都以一个process_info标签开始和结束。该文件可以包含许多这样的行,可能有许多类似的文件。

<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

我想记录result元素的所有不同值,所以我的输出将是这样的:

“D14 - 调用 *144”计数 2
“重复 CDR”计数 1
“CDR 级别错误;文件处理继续。” 计数 1

我该怎么办?我想使用XML:Twigor XML:Parser,但由于文件中有许多开始/结束标签,我无法找出解决方案。

4

5 回答 5

1

使用任何 Perl XML 模块都可以方便地完成此操作,但既然您提到XML::Twig了 ,这就是我在此解决方案中使用的。

您说可能有许多类似的 XML 文件,但没有说明如何识别它们,所以我所能做的就是为您提供单个文件的解决方案,并希望您可以从这里推断。

该程序通过逐行读取文件,将每一行解析为单独的 XML 文档,并提取具有result标记的根文档的第一个子元素的文本值来工作。此文本值用作哈希键以跟踪每个不同结果的出现次数。

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new;

my %results;

open my $fh, '<', 'my.xml' or die $!;

while (<$fh>) {
  $twig->parse($_);
  my $result = $twig->root->first_child('result');
  if ($result) {
    $result = $result->trimmed_text;
    $results{$result}++;
  }
}

for (sort keys %results) {
  my $n = $results{$_};
  printf qq("%s" count %d\n), $_, $n;
}

输出

"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1
于 2012-11-24T18:12:57.830 回答
1

您可以使用 Mojolicious 套件中出色的 DOM 解析器Mojo::DOM来计算这些。这很简单。使用哈希 ( %count) 来跟踪您找到结果的频率。这是这类问题的典型 Perl 习惯用法。

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# read all input lines at once
my $dom = Mojo::DOM->new(do {local $/; <DATA>});

# prepare count hash
my %count = ();

# iterate result elements
$dom->find('result')->each(sub {
    my $element = shift;
    $count{$element->text}++;
});

# output
say "$_: $count{$_}" for keys %count;

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

输出:

Duplicate CDR: 1
Error on CDR level; File processing continued.: 1
D14 - Calls *144: 2
于 2012-11-24T17:17:28.830 回答
0

您可以使用 XML::SAX::PurePerl,它非常防故障,并且根据我的经验,可以很好地处理混乱的 XML:

#!/usr/bin/env perl
package Result::Extractor;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    return bless {
        count   => {},
        data    => '',
    };
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
}

sub end_element {
    my ($self, $el) = @_;
    if ($el->{Name} eq 'result') {
        ++$self->{count}{$self->{data}};
    }
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use Data::Printer;
use XML::SAX::PurePerl;

my $handler = Result::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_string(do { local $/; '<wrapper>' . <DATA> . '</wrapper>' });

p $handler->{count};

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

结果:

\ {
    'Duplicate CDR'                                    1,
    'D14 - Calls *144'                                 2,
    'Error on CDR level; File processing continued.'   1
}

您还可以检查XML::SAX::ExpatXML::SAX::ExpatXSXML::LibXML::SAX;它们速度更快,但更容易出错。

于 2012-11-26T02:21:39.380 回答
-1

如果您假设 的每个实例<result>...</result>都是您感兴趣的实例,那么您也许可以使用正则表达式:

my $doc = read_file("file.xml"); # slurp in the doc
my %count;

while ($doc =~ m,<result.*?>(.*?)</result>,g) {
  $count{$1}++;
}

但我会为此使用一个真正的 XML 处理库,例如XML::XPath. XML::Path将示例程序改编为您的 XML 文件非常容易:

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'file.xml');

my $nodeset = $xp->find('/zzz/process_info/result'); # find all results

my %count;
foreach my $node ($nodeset->get_nodelist) {
  $count{ $node->string_value } ++;
}

请注意,我使用的 xpath /zzz/...- 您的 XML 文档的顶层必须是单个元素,因此我将您的示例附在<zzz>...</zzz>.

这是一个更加健壮的解决方案,因为它只会定位result元素的子process_info元素。

于 2012-11-24T16:45:03.723 回答
-1
perl -MXML::Twig -E'XML::Twig->new( twig_handlers => { result => sub { $count{$_->text}++ } })->parsefile( $ARGV[0]); say "$_: $count{$_}" foreach sort keys %count; ' count.xml

如果您的数据是 XML,它将起作用。

它不是。

于 2012-11-24T17:48:40.993 回答