2

我有以下类型的字符串(引号表示它们都在一行上):

“氨基-2,4,6-三碘苯甲酸 Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo 和 Knut Wille, Baerum, Norway,Nye-5 的转让人”

“生产乙烯化合物的过程 Duncan Clark 和 Percy Hayden,Norton-on-Tees,Eng-5 土地,ImperiaI Chemical Industries Limited,伦敦,英国的转让人”

我想得到标题之后的所有内容(全部大写的部分)。所以我想得到:

“Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo 和 Knut Wille, Baerum, 挪威,Nye-5 的转让人”

“Duncan Clark 和 Percy Hayden,Norton-on-Tees,Eng-5 土地,ImperiaI Chemical Industries Limited,伦敦,英国的转让人”

我有比这两个更多的字符串,但基本格式是发明的标题总是大写的字母和数字。

有没有办法用 perl 中的正则表达式来做到这一点?

4

5 回答 5

1

好吧,如果它不需要 100% 准确,我只会查找第一个大写字母,然后是第一个小写字母,然后抓取该行的其余部分。

像这样的东西(我的 perl 有点生疏,请原谅任何语法错误):

$part_of_line = $full_line =~/([A-Z][a-z].*)/

于 2012-05-14T06:48:58.527 回答
0

怎么样:

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;

my $re = qr
    /^                # Start of string
    [\p{Lu}\pN, -]+   # one or more uppercase letter or number or comma or space or dash
    (                 # start group 1
      \p{Lu}[\pL.']   # one uppercase letter followed by any letter or dot or apostroph
    )                 # end group
    /x;
while(<DATA>) {
    chomp;
    s/$re/$1/g;       # replace match by group 1
    say;
}


__DATA__
AMINO-2,4,6-TRIIODOBENZOIC ACIDS Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS D.Clark
PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS O'Connors

输出:

Hugo Holtermann, Baerum, Leif Gunnar Haugen, Oslo, and Knut Wille, Baerum, Norway, assignors to Nye- 5
Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England
D.Clark
O'Connors
于 2012-05-14T17:27:56.290 回答
0

我试过了,得到了你期望的输出

if($ip =~ m/([A-Z0-9,\- ]+)([A-Z]+[a-z]+.*)/)
{
      print "$2";
}
于 2012-05-14T06:53:52.803 回答
0

标题总是以大写字母+空格结尾,所以这应该有效:

/^.+[A-Z]+ (.+)$/;
print $1;
于 2012-05-14T10:51:48.117 回答
0

尝试这个:

$text = "PROCESS FOR THE PRODUCTION OF ETHYLENIC COMPOUNDS Duncan Clark and Percy Hayden, Norton-on-Tees, Eng- 5 land, assignors to ImperiaI Chemical Industries Limited, London, England ";

if($text =~ m/(\b[A-Z0-9-, ]+)\b(.*)/) {
    print "$2";
}
于 2012-05-14T06:48:23.103 回答