text - 将名称列表分隔为：“FirstName {TAB} Lastname”对

Question

如果您想转换/翻译以下行，是否可以使用特定的库、算法或技术（除了使用正则表达式）。

"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"

这些行应转换为包含以下内容的文本：

Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike

原始文本仅包含专有名称和中间名首字母（D. 或 J.），除了偶尔的“和”分隔兄弟姐妹，其姓氏与上述原始文本最后一行中的姓氏相同。

此外，这是否被认为是“命名实体识别”，或者这个过程是否有其他技术名称？

理想情况下，我想要像 Ruby/Python/Perl/PHP 这样可以进行这种翻译的语言的代码或算法。

有任何想法吗？提前致谢。

score 0 · Accepted Answer

这几乎有效：

#!/usr/bin/env perl
use strict;
use warnings;

my $tok = undef;
my @pairs = ();
my $looking_for = 'surname';

sub parse_line_to_words($){
    my $l = shift;
    my @words;
    my $word = '';
    my $start = 1;

    # remove trailing newlines
    chomp $l;
    if(index($l, '"', -1) != -1){
            # remove trailing quotation mark.
            chop $l;
    }
    foreach my $c (split//,$l){
            if($c eq '"'){
                    if($#words == -1){
                            # skip leading quotation marks
                            next;
                    }
            }

            if($c eq ','){
                    push(@words, $word);
                    $word = '';
                    $start = 1;
            } else{
                    if($start && $c eq ' '){
                            next;
                    } else{
                            $start = 0;
                    }
                    $word .= $c;
            }
    }
    if($word ne ''){
            push(@words, $word);
    }
    return @words;
}
sub peek_and(@){
    foreach my $word (@_){
            return 1 if $word eq 'and'
    }
    return 0;
}
sub split_and(@){
    my @copy;
    foreach my $word (@_){
            if(index($word, 'and ', 0) != -1){
                    my $i = index($word, 'and ', 0) + 4;
                    push(@copy, substr($word, 0, $i - 1));
                    push(@copy, substr($word, $i));
            } else{
                    push(@copy, $word);
            }
    }
    return @copy;
}
sub count_spaces($){
    my $w = shift;
    my $s=0;
    for(my $p = index($w, ' ', 0); $p != -1; $p=index($w, ' ', $p+1), $s++) {}
    return $s;
}
sub found($$$){
    my $pairs = shift;
    push(@{$pairs}, {'surname' => shift, 'firstname' => shift});
}
while(<>){
    chomp;
    my $line = $_;
    my @words = parse_line_to_words($line);
    @words = split_and(@words);
    my $line_has_and = peek_and(@words);
    foreach my $word (@words){
            my $spaces = count_spaces($word);

            if($looking_for eq 'surname'){
                    if(index($word, '.', -1) != -1 && $spaces == 0){
                            # looks like an initial to me, skip it
                    } else{
                            if($spaces > 0){
                                    # multi-word token; must be corporation name
                                    my($f, $l) = split(/ /, $word);
                                    found(\@pairs, $f, $l);
                            } else{
                                    $tok = $word;
                                    $looking_for = 'firstname';
                            }
                    }
            } elsif ($looking_for eq 'firstname'){
                    if($line_has_and){
                            # lastname, first1, ..., firstn and firstn+1
                            if($word ne 'and'){
                                    found(\@pairs, $tok, $word);
                            }
                    } else{
                            # lastname, f. or lastname, firstname
                            found(\@pairs, $tok, $word);
                            $looking_for = 'surname';
                    }
            }
    }
    $looking_for = 'surname'; # reset for new line
}

foreach my $p (@pairs){
    printf("%s\t%s\n", $p->{'surname'}, $p->{'firstname'});
}

给定样本输入的实际输出

Acme    Corporation
John    Doe
Smith   Allen
Smith   Susan
Marshall        J.
Johnson H.
Caruso  D.
Jones   J.
Stein   Harry
Stein   Joan
Stein   Mike

讨论

我采用了以下启发式方法：

应忽略行上的前导引号和尾随引号。
每一行都可以被标记为单词作为一系列逗号分隔的值。
如果单词以空格字符开头，则应忽略这些字符。
任何一对单词的第一个单词是姓氏，第二个单词是名字（特殊情况除外）。
如果一行中的一个单词以 'and ' 开头，则应特别对待整行，其中第一个单词是姓氏，其余的是相应的名字。
如果姓氏超过 0 个空格，则它是公司的名称
公司名称始终是两个以空格分隔的单词，应分别视为姓氏和名字。
非公司名称不包含空格。

最后，我使用“正则表达式”仅在空间上拆分公司名称；这可以用非正则表达式版本轻松替换。

即使有了所有这些，我仍然会弄错“John Doe”，因为它的名字在输入中是颠倒的。我无法设计出一种可靠的方法来检测这一点。

text - 将名称列表分隔为：“FirstName {TAB} Lastname”对

1 回答 1

给定样本输入的实际输出

讨论

Related

Reference