perl - Perl 非英文字符

Question

请参阅这段 perl 代码：

#!/usr/bin/perl -w -CS

use feature 'unicode_strings';

open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";

binmode( IN,  ':utf8' );
binmode( OUT, ':utf8' );

## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model

while (<IN>) {

  # Remove starting and trailing tags (e.g. <s>)
  # s/\<[a-z\/]+\>//g;

  # Remove ellipses 
  s/\.\.\./ /g;

  # Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
  # Unicode 2026 (horizontal ellipsis)
  # Unicode 2013 and 2014 (m- and n-dash)
  s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;

  # Remove dashes surrounded by spaces (e.g. phrase - phrase)
  s/\s-+\s/ /g;

  # Remove dashes between words with no spaces (e.g. word--word)
  s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;

  # Remove dash at a word end (e.g. three- to five-year)
  s/(\w)-\s/$1 /g;

  # Remove some punctuation
  s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;

  # Remove quotes
  s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;

  # Remove trailing space
  s/ $//;

  # Remove double single-quotes 
  s/'' / /g;
  s/ ''/ /g;

  # Replace accented e with normal e for consistency with the CMU pronunciation dictionary
  s/?/e/g;

  # Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
  s/\s'([\w\s]+[\w])'\s/ $1 /g;

  # Remove double spaces
  s/\s+/ /g;

  # Remove leading space
  s/^\s+//;

  chomp($_);

  print OUT uc($_) . "\n";
#  print uc($_) . " ";
} print OUT "\n";

第 49 行似乎有一个非英文字符，即 line s/?/e/g;。所以当我运行这个时，警告出来了Quantifier follows nothing in regex;。

我该如何处理这个问题？如何让perl识别字符？我必须用 perl 5.10 运行这段代码。

另一个小问题是第一行中的“-CS”是什么意思。

谢谢大家。

score 1 · Accepted Answer

我认为您的问题是您的编辑器不处理 unicode 字符，因此该程序甚至在进入 perl 之前就被丢弃了，并且由于这显然不是您的程序，它可能在它到达您之前就被丢弃了。

在整个工具链正确处理 unicode 之前，您必须小心以保留它们的方式对非 ascii 字符进行编码。这是一种痛苦，不存在简单的解决方案。请查阅您的 perl 手册，了解如何安全地嵌入 unicode 字符。

score 1 · Accepted Answer

根据错误行之前的注释行，要替换的字符是带重音的“e”；大概意思是带有尖锐重音的e：“é”。假设您的输入是 Unicode，它可以在 Perl 中表示为\x{00E9}. 另见http://www.fileformat.info/info/unicode/char/e9/index.htm

我猜您从服务器上的网页复制/粘贴了此脚本，该服务器未正确配置为显示所需的字符编码。另请参阅http://en.wikipedia.org/wiki/Mojibake

perl - Perl 非英文字符

2 回答 2

Related

Reference