regex - 如何从正则表达式模式中仅捕获姓氏？

Question

团队

我编写了一个 Perl 程序来验证姓氏、名字和年份的格式（标点符号等）的准确性。如果特定条目不遵循指定的模式，则突出显示该条目以进行修复。

例如，我的输入文件有类似的文本行：

<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的程序工作得很好，也就是说，如果任何条目不符合模式，脚本就会产生错误。上述输入文本不会产生任何错误。但下面是一个错误示例，因为Rose AJ在Rose之后缺少逗号：

NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., &amp; Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey &amp; D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>

从我的正则表达式搜索模式中，是否可以捕获所有姓氏和年份，所以我可以生成一个文本前缀到每一行，如下所示？

<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., &amp; Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>

我的正则表达式搜索脚本如下：

while(<$INPUT_REF_XML_FH>){
    $line_count += 1;
    chomp;
    if(/

    # bibliomixed XML ID tag and attribute----<START>
    <bibliomixed
    \s+
    id=".*?">
    # bibliomixed XML ID tag and attribute----<END>

    # --------2 OR MORE AUTHOR GROUP--------<START>
    (?:
    (?:
    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>
    ,\s)+
    #---------------FINAL AUTHOR GROUP SEPATOR----<START>
    &amp;\s
    #---------------FINAL AUTHOR GROUP SEPATOR----<END>

    # --------2 OR MORE AUTHOR GROUP--------<END>
    )? 

    # --------LAST AUTHOR GROUP--------<START>

    # pattern for surname----<START>
    (?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
    (?:(?:[\w\x{2019}|\x{0027}]+-)+)?  # surnames with hyphens
    (?:[A-Z](?:\x{2019}|\x{0027}))?  # surnames with closing single quote or apostrophe O’Leary
    (?:St\.\s)? # pattern for St.
    (?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
    (?:[\w\x{2019}|\x{0027}]+)  # final surname pattern----REQUIRED
    # pattern for surname----<END>
    ,\s
    # pattern for forename----<START>
    (?:
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    (?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
    (?:(?:[A-Z]\.\s)+)?  #initials with periods
    [A-Z]\. #----REQUIRED
    # pattern for titles....<START>
    (?:,\s(?:Jr\.|Sr\.|II|III|IV))?
    # pattern for titles....<END>
    )
    # pattern for forename----<END>

    (?: # pattern for editor notation----<START>
    \s\(Ed(?:s)?\.\)\.
    )? # pattern for editor notation----<END>

    # --------LAST AUTHOR GROUP--------<END>
    \s
    \(
    # pattern for a year----<START>
    (?:[A-Za-z]+,\s)? # July, 1999
    (?:[A-Za-z]+\s)? # July 1999
    (?:[0-9]{4}\/)? # 1999\/2000
    (?:\w+\s\d+,\s)?# August 18, 2003
    (?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
    (?:[A-Za-z])? # 1999a
    (?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
    (?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
    (?:,\s[A-Za-z]+)? # 1999, Spring
    (?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
    (?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
    (?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
    # pattern for a year----<END>
    \)\.
    /six){
        print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
        $found_count += 1;
    } else{
        print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
        $not_found_count += 1;
    }

谢谢你的帮助，

普雷姆

score 0 · Accepted Answer

您的所有子模式都是非捕获组，以(?:. 这通过许多因素减少了编译时间，其中之一是未捕获子模式。

要捕获模式，您只需在需要捕获的部分周围加上括号。因此，您可以删除非捕获断言或在需要它们的?:地方放置括号。http://perldoc.perl.org/perlretut.html#Non-capturing-groupings()

我不确定，但是，从您的代码中，我认为您可能正在尝试使用前瞻断言，例如，您使用空格测试姓氏，如果没有，则使用连字符测试姓氏。这不会每次都从同一个点开始，它要么匹配第一个例子，要么不匹配，然后继续用第二个姓氏模式测试下一个位置，正则表达式是否会测试第一个子模式的第二个名字是什么我不确定。http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind

#!usr/bin/perl

use warnings;
use strict;


my $line = '123 456 7antelope89';

$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;

my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7bealzelope89';

$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);


$line = '123 456 7canteloupe89';

$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;

($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');

print 'a: ',$ay,'b: ',$be,$/;

undef for ($ay,$be,$1,$2);

exit 0;

对于捕获整个模式，第三个示例的第一个模式没有意义，因为这告诉正则表达式在捕获模式组的同时不捕获模式组。这在第二模式中是有用的，它是细粒度模式捕获，因为捕获的模式是非捕获组的一部分。

a: 123 456 b: 7antelope89
a: nocapture b: nocapture 
a: 123 456 b: canteloupe

一个小点点滴滴

  id=".*?"

可能会更好

  id="\w*?"

id 名称需要是 _alphanumeric iirc。

score 0 · Accepted Answer

更改此位

# pattern for surname----<END>
    ,?\s

这现在意味着一个可选的，后跟空格。如果人的姓氏是“Bunga Bunga”，它将不起作用

regex - 如何从正则表达式模式中仅捕获姓氏？

2 回答 2

Related

Reference