我正在尝试从本地存储的 HTML 文件中获取所有链接并构建一个哈希,我正在使用 File::Find 来获取 html 文件,但已将其排除在代码之外。
- 第一个哈希键将是标题
- 第二个键镜像
- 第三个键的部分,然后是 url
像
$hash{$title}{$mirror}{$part}=$url;
我可以获得具有单个零件和单个镜子的链接,但我没有得到多个零件,目前我被困在一个循环中。我通过匹配 url 的模式来获取镜像,但是如果它存在,我如何获取该部分,否则 $part = "part_1" 我需要移动到下一个 url
#!/usr/bin/perl
my $Html = qq(
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=windows-1250">
<meta name="generator" content="PSPad editor, www.pspad.com">
<title>First hash key</title>
</head>
<body>
<div>
<br><b>Multi Links</b><br><br><!--colorstart:#FF0000-->
<span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend-->
<br><a href="http://mirror1.com/rvvaq1hi" target="_blank"><b>Part 1</b></a>
<br><a href="http://mirror1.com/w33h9ym2" target="_blank"><b>Part 2</b></a>
<br><a href="http://mirror1.com/fdnppn15" target="_blank"><b>Part 3</b></a></div>
</div>
<div>
<br><b>Single link multiple mirrors</b><br>
<br><a href="http://mirror1.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 1</b><!--colorend--></span><!--/colorend--></a></div>
<br><a href="http://mirror2.com/t2wx9603" target="_blank"><!--colorstart:#FF0000--><span style="color:#FF0000"><!--/colorstart--><b>Mirror 2</b><!--colorend--></span><!--/colorend--></a></div>
</div>
</body>
</html>
);
my @html = split(\n,$Html);
my $TheMain;
my $Title;
my @Names=(Mirror1,Mirror2,Mirror3);
my %hash;
foreach my $line (@html)
{
print "Da Line [$line]\n";
if ($line =~ m{<title>(.*?)</title>} )
{
$Title = $1;
print "$Title\n";
}
$line =~ s/\"/'/g; # Double quotes to single
$line=~ s{\n}{}g; #remove \n
$line=~ s{\s+}{ }g;#remove excessive spaces
$TheMain = $TheMain . $line;
}
print "$TheMain\n";
unless ($TheMain eq "") # unless empty enter the loop
{
while ($TheMain =~ m{a href=(.*?)/a})
{
my $A = $1;
print "the A $A\n"; ## stuck in a loop
my ($url,$part);
$A =~ s/<.*?color.*?>//ig;
while ($A =~ m{\'(http.*?)\'.*?<b>(.*?)</b> }gi)
{
$url = $1;
$part = $2;
if ($part =~m/part/i)
{
$part =~ s/ /_/;
}
else
{
$part = "part_1";
}
}
foreach my $mirror (@NAMES) # fillters out unwanted links
{
if ($url =~/$mirror/i)
{
$hash{$Title}{$mirror}{$part}=$url;
}
}
}
}
for my $Title (sort keys %hash)
{
for my $Host (sort keys %{$hash{$Title}})
{
for my $part (sort keys %{$hash{$Title}{$Host}})
{
my $url = $hash{$Title}{$Host}{$part};
print "$Title,$url\n";
}
}
}