以这两个文本为例
my $line = "[cytokine]<ADJVNT-PROP-0> signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]<EXP-PP-V-0>";
my $line2 = "[Human [papillomavirus]<VACC-PROP-0>]<VACC-PROP-0> genotype [31]<NUM> does not [express]<EXP-V-0> detectable [microRNA]<MIR-0> levels [during]<PREP> latent or productive virus replication.";
我想要做的是提取所有以<VAC
or为界的字符串,<ADJ
并且<EXP
在左侧有多个匹配项时,将字符串从最里面开始提取到右侧的末尾,直到最远。
例如上面的结果我想要一个返回这些的正则表达式:
Output1: signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]
Output2: genotype [31]<NUM> does not [express]
为什么此代码不起作用:
my @lines = ("[cytokine]<ADJVNT-PROP-0> signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]<EXP-PP-V-0>",
"[Human [papillomavirus]<VACC-PROP-0>]<VACC-PROP-0> genotype [31]<NUM> does not [express]<EXP-V-0> detectable [microRNA]<MIR-0> levels [during]<PREP> latent or productive virus replication.");
my $count = 0;
foreach $line (@lines) {
$count++;
my ($sel) = $line =~ /<VAC|<ADJ.*>(.*)<EXP.*>/;
print "Output $count: $sel\n";
}
可在此处执行:https ://eval.in/50772
正确的方法是什么?