3

我有一个大约 2 MB 的html文件,我需要解析它,其中包含大约 500 行和大约 70 列。我显然需要以一种稍后可以将其输入 SQL Server 数据库的方式对其进行清理。我过去用 Perl 解析过文件,但从来没有解析过html文件,我想知道在进行普通匹配和格式化之前是否应该检查任何模块。

一点更新:

<td class="tableHeaderDarkCenter">CUSIP/ISIN</td>

<td class="tableHeaderDarkCenter">Stock Ticker</td>

<td class="tableHeaderDarkCenter">MLCC Code</td>

<td class="tableHeaderDarkCenter">Bond Ticker</td>

<td class="tableHeaderDarkCenter">Issuer Name</td>

<td class="tableHeaderDarkCenter">Convertible Price(USD)</td>

<td class="tableHeaderDarkCenter">Par Amount</td>

<td class="tableHeaderDarkCenter">Coupon</td>

<td class="tableHeaderDarkCenter">Maturity/Mandatory Conversion Date</td>

<td class="tableHeaderDarkCenter">Outstanding Amt ($MM)</td>

<td class="tableHeaderDarkCenter">Bonds/Shrs Outstanding</td>

<td class="tableHeaderDarkCenter">Market Value($MM)</td>

<td class="tableHeaderDarkCenter">Index Weight(%)</td>

<td class="tableHeaderDarkCenter">YTM(%)</td>

<td class="tableHeaderDarkCenter">YTP(%)</td>

<td class="tableHeaderDarkCenter">Greater of YTM/YTP(%)</td>

<td class="tableHeaderDarkCenter">Duration</td>

<td class="tableHeaderDarkCenter">Currency</td>

<td class="tableHeaderDarkCenter">Country</td>

<td class="tableHeaderDarkCenter">Series</td>

<td class="tableHeaderDarkCenter">Accrued Interest</td>

<td class="tableHeaderDarkCenter">Current Yield(%)</td>

<td class="tableHeaderDarkCenter">Yield Advantage(%)</td>

<td class="tableHeaderDarkCenter">Moody Rating</td>

<td class="tableHeaderDarkCenter">S&P Rating</td>

<td class="tableHeaderDarkCenter">Avg. Rating</td>

<td class="tableHeaderDarkCenter">Internal Rating</td>

<td class="tableHeaderDarkCenter">Collateral Type</td>

<td class="tableHeaderDarkCenter">Status</td>

<td class="tableHeaderDarkCenter">Security Type</td>

<td class="tableHeaderDarkCenter">Announce Date</td>

<td class="tableHeaderDarkCenter">Issue Date</td>

<td class="tableHeaderDarkCenter">At-Issue Yield</td>

<td class="tableHeaderDarkCenter">At-Issue Prem</td>

<td class="tableHeaderDarkCenter">Delta</td>

<td class="tableHeaderDarkCenter">Gamma</td>

<td class="tableHeaderDarkCenter">RHO</td>

<td class="tableHeaderDarkCenter">Theoretical Value</td>

<td class="tableHeaderDarkCenter">Theoretical Discount (%)</td>

<td class="tableHeaderDarkCenter">Cheap (%)</td>

<td class="tableHeaderDarkCenter">Conversion Ratio</td>

<td class="tableHeaderDarkCenter">Parity Cash Adjustment</td>

<td class="tableHeaderDarkCenter">Payback</td>

<td class="tableHeaderDarkCenter">Implied Volatility(%)</td>

<td class="tableHeaderDarkCenter">Implied Spread</td>

<td class="tableHeaderDarkCenter">Parity Delta</td>

<td class="tableHeaderDarkCenter">Conversion Premium(%)</td>

<td class="tableHeaderDarkCenter">Investment Value Premium(%)</td>

<td class="tableHeaderDarkCenter">Investment Value(Bond floor)</td>

<td class="tableHeaderDarkCenter">Price to Par</td>

<td class="tableHeaderDarkCenter">Next Put Date</td>

<td class="tableHeaderDarkCenter">Yrs to Put</td>

<td class="tableHeaderDarkCenter">Yrs to Mat/Mand Conv Date</td>

<td class="tableHeaderDarkCenter">Yrs to Mat/Yrs to Put</td>

<td class="tableHeaderDarkCenter">Years to Call</td>

<td class="tableHeaderDarkCenter">1-Day Total Return(%)</td>

<td class="tableHeaderDarkCenter">1-WK Total Return(%)</td>

<td class="tableHeaderDarkCenter">MTD Total Return(%)</td>

<td class="tableHeaderDarkCenter">QTD Total Return(%)</td>

<td class="tableHeaderDarkCenter">YTD Total Return(%)</td>

<td class="tableHeaderDarkCenter">Index Sector</td>

<td class="tableHeaderDarkCenter">Industry</td>

<td class="tableHeaderDarkCenter">CS Sector L1</td>

<td class="tableHeaderDarkCenter">CS Sector L2</td>

<td class="tableHeaderDarkCenter">CS Sector L3</td>

<td class="tableHeaderDarkCenter">CS Sector L4</td>

<td class="tableHeaderDarkCenter">CS Sector L5</td>

<td class="tableHeaderDarkCenter">ML Sector L1</td>

<td class="tableHeaderDarkCenter">ML Sector L2</td>

<td class="tableHeaderDarkCenter">ML Sector L3</td>

<td class="tableHeaderDarkCenter">ML Sector L4</td>

<td class="tableHeaderDarkCenter">GIC Sector</td>

<td class="tableHeaderDarkCenter">GIC Industry Group</td>

<td class="tableHeaderDarkCenter">GIC Industry</td>

<td class="tableHeaderDarkCenter">GIC Sub Industry</td>

<td class="tableHeaderDarkCenter">Bloomberg Sector</td>

<td class="tableHeaderDarkCenter">Stock Price(USD)</td>

<td class="tableHeaderDarkCenter">Stock Yield</td>

<td class="tableHeaderDarkCenter">1-Day Equity Total Return(%)</td>

<td class="tableHeaderDarkCenter">1-WK Equity Total Return(%)</td>

<td class="tableHeaderDarkCenter">MTD Equity Total Return(%)</td>

<td class="tableHeaderDarkCenter">QTD Equity Total Return(%)</td>

<td class="tableHeaderDarkCenter">YTD Equity Total Return(%)</td>

<td class="tableHeaderDarkCenter">Eq Mkt Value($MM)</td>

这是文件中发生的内容,从文件末尾到文件末尾是映射到列名的相应值。显然很多。我正在试一试HTML::TableExtract,但我不确定它是否适合这种情况。

4

2 回答 2

5

你的朋友在这里HTML::TableExtract。它写得很好(在总共五次评论中获得了五颗星中的五颗星),并允许您以非常方便的方式指定要提取的数据。


更新

为了演示应用到某些 HTML 数据是多么容易HTML::TableExtract,这里有一些代码打印文件中找到的第一个表的每一行。

如果文件中有多个表,那么您必须使用模块提供的几种方法中的一种来选择其中一种。

每行的数据都以 Perl 数组的形式返回,使用 DBI 将其存储在数据库中很简单。

use strict;
use warnings;

use HTML::TableExtract;

my $te = HTML::TableExtract->new;
$te->parse_file('data_snippet.txt');
my $table = $te->first_table_found;

for my $row ($table->rows) {
  print "@$row\n";
}
于 2012-07-06T16:25:40.257 回答
1

使用现成的 HTML 解析器,如HTML::Parser,HTML::TreeBuilder和 like。然后简单地遍历 DOM 中的表。

于 2012-07-06T16:22:14.027 回答