regex - Matching words in UTF-8 encoded string with Ruby 1.9.1

Question

I want to match all individual words in given string, provided that String is UTF-8 encoded, and then I spellcheck each word. Everything works with my code provided it's english-only text, but if there are some, say, German characters, my words are split in two on these characters. How can I match single words from text, that contain latin and not-latin characters?

What I do now is:

text.gsub(/[\w\']+/) do |word| "replacement" end

but this, for text containing "oooäuuu" will end up with "replacementäreplacement", i.e: German characters are not being treated as part of word.

score 2 · Accepted Answer

According to Pickaxe, the \w character class is exactly equivalent to [A-Za-z0-9_], which obviously won't include accented characters. Depending on your locale, you may find the POSIX class [:alpha:] to be what you want (I think you would use /[[:alpha:]']+/, but I may be wrong on the exact formatting of the regexp there).

score 2 · Accepted Answer

2

看起来这很好用：

/[[:word:]]+/

那太容易了；）

于 2010-01-12T22:58:56.080 回答

score 0 · Accepted Answer

您需要的是英语|德语|...标记器？自然语言中的标记化并不像寻找空格那么简单。例如，如果您想对这句话进行标记：“洛杉矶是一座美丽的城市”。如果您想在字典中找到洛杉矶，则应将其视为一个词而不是两个词。

您还应该处理标点符号 (.;?!:)、缩写词、分隔符、引号、紧缩词等...

中文或日文等语言的标记化要困难得多。

Jurafsky 和 Martin 在第 3.9.1 章的“语音和语言处理”中有一个简单的英语标记化 perl 脚本。

regex - Matching words in UTF-8 encoded string with Ruby 1.9.1

3 回答 3

Related

Reference