"d̪".chars.to_a
给我
["d"," ̪"]
如何让 Ruby 按字素拆分它?
["d̪"]
"d̪".chars.to_a
给我
["d"," ̪"]
如何让 Ruby 按字素拆分它?
["d̪"]
编辑:正如@michau's answer notes,Ruby 2.5 引入了该grapheme_clusters
方法,以及each_grapheme_cluster
如果您只想迭代/枚举而不必创建数组。
在 Ruby 2.0 或更高版本中,您可以使用str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
如果您出于任何原因想要匹配字素边界,您可以(?=\X)
在您的正则表达式中使用,例如:
> "d̪".split /(?=\X)/
=> ["d̪"]
\X
如果您由于某种原因不能使用 ActiveSupport(包含在 Rails 中)也有一种方法:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }
Unicode::text_elements
从http://www.yoshidam.net/unicode.txt中记录的unicode.gem使用。
irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]
以下代码应该在 Ruby 2.5 中工作:
"d̪".grapheme_clusters # => ["d̪"]
Ruby2.0
str = "d̪"
char = str[/\p{M}/]
other = str[/\w/]