0

我在 perl 中使用 SimpleXml 来提取标签中的数据

<description>&lt;strong&gt;CUSIP:&lt;/strong&gt; 912828UC2&lt;br /&gt;&lt;strong&gt;Term and Type:&lt;/strong&gt; 3-Year Note&lt;br /&gt;&lt;strong&gt;Offering Amount:&lt;/strong&gt; $32,000,000,000&lt;br /&gt;&lt;strong&gt;Auction Date:&lt;/strong&gt; 12/11/2012&lt;br /&gt;&lt;strong&gt;Issue Date:&lt;/strong&gt; 12/17/2012&lt;br /&gt;&lt;strong&gt;Maturity Date:&lt;/strong&gt; 12/15/2015&lt;br /&gt;&lt;a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf"&gt;PDF version of the announcement&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml"&gt;XML version of the announcement&lt;/a&gt;&lt;br /&gt;</description>

我现在无法提取单个符号。例如,对于拍卖日期,使用

if ($desc=~m/Auction\sDate:<\/strong>\s+(\d\d\/\d\d\/\d\d\d\d)<br/) {

}

但我觉得它不够健壮。提取字段的标准方法是什么?

4

3 回答 3

2

正如 Dan1111 在他的回答中指出的那样,如果您已经在使用 XML 解析器(Simple::XML?),您应该坚持使用它来解析描述标签中的数据。尝试从 XML/HTML 提要中解析数据不是一个好主意。使用为此目的构建的解析器。

由于您帖子中数据的格式,我假设您没有解析器可以帮助您的有效 HTML。在这种情况下,没有提取字段的“标准”方法,但这是我解决此问题的方法:

print "$desc\n";

my @parts = split(/;br /, $desc);
my %dates;

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}

foreach my $label (keys %dates) {
  printf "%-16s%12s\n", "${label}:", $dates{$label};
}

查看原始字符串,我可以看到有 3 个日期和其他几条记录,所以首先要做的就是将split它们向上。我发现字符串中的每条记录都由字符';br'分隔,因此我将其用于拆分:

my @parts = split(/;br /, $desc);

之后,我有一个数组,其中包含字符串中每个不同的数据部分。现在,我只需要解析每个部分。因为您的问题对拍卖日期值感兴趣,所以我编写了一个正则表达式来捕获日期。预计其他日期也可能有价值,我修改了我的正则表达式,以便我可以捕获标签(拍卖、发行、成熟度),并将每个标签日期对存储在哈希(%dates)中:

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}

最后,我刚刚打印出我的哈希:

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}  

说得通?

于 2012-12-12T14:34:51.717 回答
0

什么更强大取决于您的预期输入和您正在寻找的内容。但是,这里有一些您可能会觉得有帮助的东西。

我用XML::Twig这个。 XML::Simple(我认为这是您现在使用的)由于各种怪癖,不建议用于新开发。

use Modern::Perl;
use XML::Twig;

my $twig = XML::Twig->new();
$twig->parse(<DATA>);

my %params;
my $key;
for my $child (map {$_->text} $twig->root->children)
{
    if ($child =~ /(.*):/)
    {
        $key = $1;  
    }
    else
    {
        $params{$key} = $child if (defined $key);
        undef $key;         
    }
}

say "$_ is $params{$_}" foreach (keys %params); 

__DATA__
<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br /><a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf">PDF version of the announcement</a><br /><a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml">XML version of the announcement</a><br /></description>

这将以冒号结尾的任何元素作为键,然后假定树中的下一个元素是值。显然,这对您将获得什么样的输入做出了一些假设,但只要所有“关键”元素都包含在标签中,它就非常健壮。

另一种方法是先去除所有标签,然后仅在文本中搜索键值对。你也可以这样做XML::Twig;只需调用$twig->root->text将从整个元素中获取文本。然而,在这种方法中,很难确定一个键在哪里结束,另一个值从哪里开始。

于 2012-12-12T14:39:54.893 回答
0

<description>您显示的 RSS 提要中的元素包含有效的 XHTML 片段作为 PCDATA。该解决方案提取这些元素并对其进行解码,然后依次解析它们以访问<strong>元素的文本及其对应的值。

请注意,XHTML 包含多个元素,并且由于 XHTML 只允许使用单个根元素,因此我将其包装<root>$twig->parse("<root>$desc</root>").

希望您能够从中推断以访问您需要的数据。

use strict;
use warnings;

use LWP::Simple;
use XML::Twig;

my $xml = get 'http://www.treasurydirect.gov/RI/TreasuryOfferingAnnouncements.rss';

my $twig = XML::Twig->new;
$twig->parse($xml);

for my $desc ($twig->get_xpath('/rss/channel/item/description')) {
  $desc = $desc->text;
  my $twig = XML::Twig->new;
  $twig->parse("<root>$desc</root>");
  for my $strong ($twig->get_xpath('/root/strong')) {
    my ($key, $val) = ($strong->trimmed_text, $strong->next_sibling->trimmed_text);
    $key =~ s/:$//;
    print "$key => $val\n";
  }
  print "\n";
}

输出

CUSIP -> 912810QY7
Term and Type -> 29-Year 11-Month Bond
Offering Amount -> $13,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2042

CUSIP -> 912796DT3
Term and Type -> 3-Day Bill
Offering Amount -> $10,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/14/2012
Maturity Date -> 12/17/2012

CUSIP -> 912828UE8
Term and Type -> 5-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/18/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2017

CUSIP -> 912828UD0
Term and Type -> 2-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2014

CUSIP -> 912796AM1
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 06/20/2013

CUSIP -> 912828UF5
Term and Type -> 7-Year Note
Offering Amount -> $29,000,000,000
Auction Date -> 12/19/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2019

CUSIP -> 912828SQ4
Term and Type -> 4-Year 4-Month TIPS
Offering Amount -> $14,000,000,000
Auction Date -> 12/20/2012
Issue Date -> 12/31/2012
Maturity Date -> 04/15/2017

CUSIP -> 9127957M7
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 03/21/2013

CUSIP -> 912828TY6
Term and Type -> 9-Year 11-Month Note
Offering Amount -> $21,000,000,000
Auction Date -> 12/12/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2022

CUSIP -> 912828UC2
Term and Type -> 3-Year Note
Offering Amount -> $32,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/17/2012
Maturity Date -> 12/15/2015

CUSIP -> 912796AK5
Term and Type -> 52-Week Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 12/12/2013

CUSIP -> 9127955V9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 01/10/2013

CUSIP -> 912796AL3
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 06/13/2013

CUSIP -> 9127957L9
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 03/14/2013

CUSIP -> 912796DT3
Term and Type -> 11-Day Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 12/17/2012

CUSIP -> 9127956Z9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 01/03/2013
于 2012-12-15T04:24:19.583 回答