perl - 需要拆分 Unicode 字符串

Question

我正在为我的翻译系统使用 moses 工具包。我正在使用阿萨姆语和英语平行语料库并对其进行培训。但有些专有名词没有翻译。这是因为我有一个非常小的语料库（并行数据集）。所以我想在我的翻译系统中使用音译过程。

我正在使用此命令进行翻译： echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini

这给了我输出“কানাদা是一个广阔的国家”。

这是因为“কানাদা”这个词不在我的平行语料库中。

所以我拿了一些阿萨姆语和英语的平行单词列表，并按字符分解每个单词。因此，两个文件的每一行都会有单个单词，每个字符（或每个音节）之间有一个空格。我已经使用这 2 个文件将系统训练为正常的翻译任务

然后我使用以下命令 echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl

这给了我输出“ক া ন া দ া 是一个幅员辽阔的国家”

我不得不打破这个词，因为我已经对系统进行了字符训练..

然后我使用了我使用命令训练的音译系统：

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini

这给了我输出“加拿大是一个幅员辽阔的国家”

字符是音译的..但唯一的问题是单词之间的空格。所以我想使用一个将加入单词的perl文件。我的最终命令将是

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl

帮我处理这个“join.pl”文件。

score 4 · Accepted Answer

怎么样：

use utf8;
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
say $str;

输出：

ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত

您可以在程序中使用它，只需将 while 循环更改为：

while(<>) {
    s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    print $_;
}

但我认为你希望这样做：

my %corresp = (
    'ভ' => 'Bh',
    'া' => 'a',
    'ৰ' => 'ra',
    'ত' => 't',
);
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
say $str;

输出：

Bharat is a famous country. দিল্লী is the capital of Bharat

注意：建立真正的相应哈希取决于您。我对阿萨姆字符一无所知。

score 4 · Accepted Answer

4

于 2013-12-24T15:47:12.250 回答

score 1 · Accepted Answer

它完全按照你的吩咐去做。@a=split('')将分割整行，你不是告诉它只分割第一个单词。您首先需要确定要拆分的子字符串，然后将其拆分：

#!/usr/bin/perl
use utf8;
use Getopt::Std;
use IO::Handle;

binmode(STDIN,  ':utf8');
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

while(<>)
{
    chomp;
    ## find the first word, capture it as $1 and delete it from the line
    s/(.+?)\s//;
    @a=split('',$1);
    ## Print your joined string and the rest of the line
    print join(" ",@a) . " $_\n";
}

score 0 · Accepted Answer

添加类似的东西

$str =~ s/([\w]) (?<=[\w.,;:!?])/$1/g;

旨在删除拉丁单词字符之间的空格。展望未来。不是100%。

perl - 需要拆分 Unicode 字符串

4 回答 4

Related

Reference