ruby - 在 Ruby 中使用拆分时保留变音符号

Question

为什么此代码（包含变音符号）：

text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
  puts w
end

返回此结果（不保留先前给出的变音符号）：

=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer

在 Ruby 1.9+ 中拆分字符串时，有没有办法可以保留变音符号？

编辑：我使用 ruby 1.9.3p286（2012-10-12 修订版 37165）[x86_64-darwin11.4.2]

score 5 · Accepted Answer

[\W]只匹配非单词字符，即它等同于[^a-zA-Z0-9_]，因此不包括（排除？）特殊字符和变音符号。您可以使用

words = text.split(/[^[:word:]]/)

匹配所有 Unicode“单词”字符，或

words = text.split(/[^\p{Latin}]/)

匹配 Unicode 拉丁脚本中的字符。
请注意，这两者都将匹配来自其他语言的特殊字符，而不仅仅是德语。

请参阅http://www.ruby-doc.org/core-1.9.3/Regexp.html并查找（1）“字符类”和（2）“字符属性”。

score 2 · Accepted Answer

2

You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)

于 2013-04-11T15:06:02.020 回答

score 2 · Accepted Answer

为什么此代码 [...] 不保留先前给出的变音符号

因为\W匹配一个非单词ASCII字符（即 not a-z、 not A-Z、 not0-9和 not _）并且ü是这样的字符。

在 Ruby 1.9+ 中拆分字符串时，有没有办法可以保留变音符号？

当然，例如，您可以按空格分隔，如果没有给出模式，这是默认设置：

"Müllmann Straßenverkehr Wirtschaftsprüfer".split
=> ["Müllmann", "Straßenverkehr", "Wirtschaftsprüfer"]

score 1 · Accepted Answer

/\W/ - 非单词字符 ([^a-zA-Z0-9_])

ü不是单词字符，因此\W在那里匹配和拆分。\p{Lu}并且\p{Ll}是 unicode 大写和小写字符的 ruby 速记，所以你可以这样做：

text.split /[^\p{Ll}\p{Lu}]/

...并且应该拆分即使是最奇特的字符串。

score 0 · Accepted Answer

because you used /\W/ to split text which means anything not in this list: a-zA-Z0-9

try split

[^\w\ü]

which is

^ not in \w a-zA-Z0-9 \ü

(alternatively look at creating your own pattern which you can reuse)

5 回答 5