2

我想使用 Ruby 1.9.3 将重音 UTF-8 字符替换为它们的 ASCII 等效字符。例如,

Acsády  -->  Acsady

执行此操作的传统方法是使用 IConv 包,它是 Ruby 标准库的一部分。你可以这样做:

str = 'Acsády'
IConv.iconv('ascii//TRANSLIT', 'utf8', str)

哪个会产生

Acsa'dy

然后必须删除撇号。虽然这种方法在 Ruby 1.9.3 中仍然有效,但我收到一条警告说IConv is deprecated and that String#encode should be used instead. 但是,String#encode不提供完全相同的功能。默认情况下,未定义的字符会抛出异常,但您可以通过设置 :undef=>:replace (将未定义的字符替换为默认的 '?' 字符)或 :fallback 选项到将未定义的源编码字符映射到的哈希值来处理它们目标编码。我想知道标准库中或通过某些 gem 是否有标准的 :fallback 哈希,这样我就不必编写自己的哈希来处理所有可能的重音符号。

@raina77ow:感谢您的回复。这正是我一直在寻找的。但是,在查看您链接到的线程后,我意识到更好的解决方案可能是简单地将非重音字符与其重音等价物匹配,就像数据库使用字符集排序规则一样。Ruby 有什么等同于排序规则的东西吗?

4

3 回答 3

3

我用这个:

def convert_to_ascii(s)
  undefined = ''
  fallback = { 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A',
               'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 'È' => 'E', 'É' => 'E',
               'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I',
               'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O',
               'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U',
               'Û' => 'U', 'Ü' => 'U', 'Ý' => 'Y', 'à' => 'a', 'á' => 'a',
               'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae',
               'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
               'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n',
               'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o',
               'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u',
               'ý' => 'y', 'ÿ' => 'y' }
  s.encode('ASCII',
           fallback: lambda { |c| fallback.key?(c) ? fallback[c] : undefined })
end

您可以在此处检查您可能想要为其提供后备的其他符号

于 2014-04-01T09:02:18.620 回答
0

我想您要查找的内容与此问题类似。如果是,您可以使用为 Ruby 编写的 Text::Unidecode 的端口——例如这个gem(或者它的这个分支,看起来它已经准备好在 1.9 中使用)。

于 2012-06-18T21:49:56.523 回答
0

以下代码适用于各种各样的欧洲语言,包括希腊语,这很难正确处理,并且以前的答案没有处理。

# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿΆΈΊΌΐάέήίΰϊϋόύώỏἀἁἂἃἄἅἆἈἉἊἌἍἎἐἑἒἓἔἕἘἙἜἝἠἡἢἣἤἥἦἧἨἩἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἼἽἾὀὁὂὃὄὅὈὉὊὋὌὍὐὑὓὔὕὖὗὙὝὠὡὢὣὤὥὦὧὨὩὫὬὭὮὯὰὲὴὶὸὺὼᾐᾑᾓᾔᾕᾖᾗᾠᾤᾦᾧᾰᾱᾳᾴᾶᾷᾸᾹῂῃῄῆῇῐῑῒῖῗῘῙῠῡῢῥῦῨῩῬῳῴῶῷῸ","AAAAAAÆCEEEEIIIINOOOOOOUUUUYaaaaaaæceeeeiiiinoooooouuuuyyΑΕΙΟιαεηιυιυουωoαααααααΑΑΑΑΑΑεεεεεεΕΕΕΕηηηηηηηηΗΗΗΗΗΗΗιιιιιιιιΙΙΙΙΙοοοοοοΟΟΟΟΟΟυυυυυυυΥΥωωωωωωωωΩΩΩΩΩΩΩαεηιουωηηηηηηηωωωωααααααΑΑηηηηηιιιιιΙΙυυυρυΥΥΡωωωωΟ")
end

它是由以下长而缓慢的程序生成的,该程序会输出到 linux 命令行实用程序“unicode”。如果您遇到此列表中缺少的字符,请将它们添加到较长的程序中,重新运行它,您将获得处理这些字符的代码输出。例如,我认为该列表缺少一些出现在捷克语中的字符,例如带有楔形的 ac,以及带有长音符号的拉丁语元音。如果这些新字符的重音不在下面的列表中,程序将不会删除它们,直到您将新重音的名称添加到names_of_accents.

$stderr.print %q{
This program generates ruby code to strip accents from characters in Latin and Greek scripts.
Progress will be printed to stderr, the final result to stdout.
}

all_characters = %q{
         ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿ
         ΆΈΊΌΐάέήίϊόύώỏἀἁἃἄἅἈἐἑἒἔἕἘἙἜἡἢἣἤἥἦἨἩἫἬἮἰἱἲἴἵἶἸὀὁὂὃὄὅὊὍὐὑὓὔὕὖὗὝὡὢὣὤὥὧὨὩὰὲὴὶὸὺὼᾐᾗᾳᾴᾶῂῆῇῖῥῦῳῶῷῸᾤᾷἂἷ
         ὌᾖὉἧἷἂῃἌὬὉἷὉἷῃὦἌἠἳᾔἉᾦἠἳᾔὠᾓὫἝὈἭἼϋὯῴἆῒῄΰῢἆὙὮᾧὮᾕὋἍἹῬἽᾕἓἯἾᾠἎῗἾῗἯἊὭἍᾑᾰῐῠᾱῑῡᾸῘῨᾹῙῩ
}.gsub(/\s/,'')
# The first line is a list of accented Latin characters. The second and third lines are polytonic Greek.
# The Greek on this list includes every character occurring in the Project Gutenberg editions of Homer, except for some that seem to be
# mistakes (smooth rho, phi and theta in symbol font). Duplications and characters out of order in this list have no effect at run time.
# Also includes vowels with macron and vrachy, which occur in Project Perseus texts sometimes.

# The following code shells out to the linux command-line utility called "unicode," which is installed as the debian package
# of the same name.
# Documentation: https://github.com/garabik/unicode/blob/master/README

names_of_accents = %q{
  acute grave circ and rough smooth ypogegrammeni diar with macron vrachy tilde ring above diaeresis cedilla stroke
  tonos dialytika hook perispomeni dasia varia psili oxia
}.split(/\s+/).select { |x| x.length>0}.sort.uniq
# The longer "circumflex" will first be shortened to "circ" in later code.

def char_to_name(c)
  return `unicode --string "#{c}" --format "{name}"`.downcase
end

def name_to_char(name)
   list = `unicode "#{name}" --format "{pchar}" --max 0` # returns a string of possibilities, not just exact matches
   # Usually, but not always, the unaccented character is the first on the list.
   list.chars.each { |c|
     if char_to_name(c)==name then return c end
   }
   raise "Unable to convert name #{name} to a character, list=#{list}."
end

regex = "( (#{names_of_accents.join("|")}))+"
from = ''
to = ''
all_characters.chars.sort.uniq.each { |c|
  name = char_to_name(c).gsub(/circumflex/,'circ')
  name.gsub!(/#{regex}/,'')
  without_accent = name_to_char(name)
  from = from+c.unicode_normalize(:nfc)
  to = to+without_accent.unicode_normalize(:nfc)
  $stderr.print c
}
$stderr.print "\n"
print %Q{
# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("#{from}","#{to}")
end
}
于 2021-07-11T17:47:38.887 回答