-1
4

4 回答 4

3

只要href属性是每个<a>标签中的第一个属性,该程序就会按照您的要求进行操作。它还检查每个名称以前是否见过,并仅在它是新名称时打印。

use strict;
use warnings;
use 5.010;

my %seen;
while ( <DATA> ) {
  while ( m{<a\s+href="([^"]*)/"}ig ) {
    say $1 unless $seen{$1}++;
  }
}

输出

acinetobacter_baumannii_26016_2
acinetobacter_baumannii_44839_10
acinetobacter_baumannii_45002_9
acinetobacter_baumannii_45075_6
acinetobacter_baumannii_796380_1375
amycolatopsis_mediterranei_gca_000700945_1
bacillus_subtilis_e1
bdellovibrio_bacteriovorus
bifidobacterium_adolescentis
bifidobacterium_breve_31l
bordetella_bronchiseptica_00_p_2796
bordetella_bronchiseptica_980_2
bordetella_bronchiseptica_d993
bordetella_bronchiseptica_mbord665
bordetella_bronchiseptica_mbord782
borrelia_garinii_sz
brucella_pinnipedialis
burkholderia_sp_mp_1
campylobacter_jejuni_10227
campylobacter_jejuni_subsp_jejuni_81_176_drh212
candidatus_caedibacter_acanthamoebae
clostridium_botulinum_d_str_16868
criblamydia_sequanensis_crib_18
enterococcus_faecalis_atcc_29212_gca_000742975_1
enterococcus_faecalis_ga2
enterococcus_faecalis_gan13
enterococcus_faecium_t110
enterococcus_faecium_uc7251
enterococcus_faecium_uc8668
enterococcus_faecium_vre1044
erythrobacter_litoralis
escherichia_coli_1_110_08_s1_c1
escherichia_coli_2_052_05_s3_c1
escherichia_coli_2_177_06_s3_c2
escherichia_coli_2_177_06_s4_c3
escherichia_coli_2_222_05_s1_c2
escherichia_coli_3_020_07_s4_c3
escherichia_coli_3_073_06_s3_c2
escherichia_coli_3_105_05_s3_c3
escherichia_coli_6_537_08_s3_c2
escherichia_coli_6_537_08_s3_c3
escherichia_coli_8_415_05_s4_c1
escherichia_coli_bidmc_72
escherichia_coli_isc56
escherichia_coli_o111_h8_str_f6627
escherichia_coli_o121_h19_str_2011c_3108
escherichia_coli_o157_h7_str_08_3527
escherichia_coli_o157_h7_str_08_4529
escherichia_coli_o157_h7_str_k4527
escherichia_coli_o6_h16_str_f5656c1
escherichia_coli_str_st540_gca_000599685_1
escherichia_coli_str_st540_gca_000599705_1
escherichia_coli_uci_53
flavobacterium_reichenbachii
gammaproteobacteria_bacterium_mfb021
georgenia_sp_subg003
gilliamella_apicola_scgc_ab_598_i20
haemophilus_parasuis_gca_000742795_1
haemophilus_parasuis_hps9
halobacillus_karajensis
halostagnicola_sp_a56
hyphomonas_jannaschiana_vp2
hyphomonas_sp_25b14_1
klebsiella_pneumoniae_chs_43
klebsiella_pneumoniae_chs_49
lactobacillus_oryzae_jcm_18671
listeria_monocytogenes_fsl_f6_684_gca_000525815_1
listeria_monocytogenes_gca_000726305_1
listeria_monocytogenes_gca_000726325_1
listeria_monocytogenes_gca_000726695_1
listeria_monocytogenes_gca_000727065_1
listeria_monocytogenes_gca_000727735_1
listeria_monocytogenes_gca_000728125_1
listeria_monocytogenes_gca_000728365_1
listeria_monocytogenes_gca_000728805_1
listeria_monocytogenes_gca_000728845_1
listeria_monocytogenes_lm_1880
listeria_monocytogenes_wslc1042
listeria_riparia_fsl_s10_1204
morganella_sp_egd_hp17
mycobacterium_africanum_gca_000666065_1
mycobacterium_africanum_mal010074
mycobacterium_africanum_mal010081
mycobacterium_tuberculosis_btb03_108
mycobacterium_tuberculosis_btb04_416
mycobacterium_tuberculosis_btb05_285
mycobacterium_tuberculosis_btb07_323
mycobacterium_tuberculosis_btb08_022
mycobacterium_tuberculosis_btb08_309
mycobacterium_tuberculosis_btb10_357
mycobacterium_tuberculosis_btb11_027
mycobacterium_tuberculosis_btb11_207
mycobacterium_tuberculosis_btb12_001
mycobacterium_tuberculosis_btb12_046
mycobacterium_tuberculosis_gca_000736075_1
mycobacterium_tuberculosis_h2438
mycobacterium_tuberculosis_h2581
mycobacterium_tuberculosis_h3005
mycobacterium_tuberculosis_kt_0043
mycobacterium_tuberculosis_kt_0084
mycobacterium_tuberculosis_kzn_1435_gca_000669675_1
mycobacterium_tuberculosis_m1236
mycobacterium_tuberculosis_m1274
mycobacterium_tuberculosis_m1461
mycobacterium_tuberculosis_m1475
mycobacterium_tuberculosis_m1848
mycobacterium_tuberculosis_m1893
mycobacterium_tuberculosis_m2086
mycobacterium_tuberculosis_m2116
mycobacterium_tuberculosis_m2193
mycobacterium_tuberculosis_m2211
mycobacterium_tuberculosis_m2435
mycobacterium_tuberculosis_mal010078
mycobacterium_tuberculosis_mal020120
mycobacterium_tuberculosis_mal020150
mycobacterium_tuberculosis_md14844
mycobacterium_tuberculosis_md14847
mycobacterium_tuberculosis_md17647
mycobacterium_tuberculosis_md17902
mycobacterium_tuberculosis_md17973
mycobacterium_tuberculosis_nritld54
mycobacterium_tuberculosis_ofxr_11
mycobacterium_tuberculosis_ofxr_15
于 2015-05-07T08:57:40.740 回答
2

我建议首先使用模块,因为 HTML 不能很好地解析正则表达式。它可能会起作用,但容易出现脆弱的代码。

因此,像这样:(感谢:http ://www.perlmonks.org/?node_id=557357 )

use strict;
use warnings;

use WWW::Mechanize;

my $mech  = WWW::Mechanize->new();

$mech->get( 'file://C:/path/to/your_html/file.html' );

my @links = $mech->links();

foreach my $link (@links) {
    my $url = $link -> url;
    $url =~ s,/$,,g; 
    print $url,"\n";
}

不过,对于您的简单数据集,这应该可以解决问题:

local $/;
my @links = <DATA> =~ m,<A HREF=\"(.*?)/?\">,g;
print join ( "\n", @links );
于 2015-05-07T08:47:33.713 回答
0

使用您的数据尝试以下代码,我得到如下结果:

acinetobacter_baumannii_26016_2
acinetobacter_baumannii_44839_1
acinetobacter_baumannii_45002_9
...

代码是:

open(f1,"/home/httpd/cgi-bin/LDU/list1.txt");
while($line=<f1>){
    $line=~/([0-9A-Za-z_]*)(\s*)[\.>].*/;
    print $1 . "\n";
}
于 2015-05-07T08:31:17.830 回答
-1
#!/usr/bin/perl
use strict;
use warnings;
open(f1,"/home/httpd/cgi-bin/LDU/list1.txt")||die("error");

while(my $line =<f1> )
{
my ($match) = ($line =~ m/(?:=")(\w+)/g);
print "$match\n";
}
于 2015-05-07T08:31:22.150 回答