1

I want to match all individual words in given string, provided that String is UTF-8 encoded, and then I spellcheck each word. Everything works with my code provided it's english-only text, but if there are some, say, German characters, my words are split in two on these characters. How can I match single words from text, that contain latin and not-latin characters?

What I do now is:

text.gsub(/[\w\']+/) do |word| "replacement" end

but this, for text containing "oooäuuu" will end up with "replacementäreplacement", i.e: German characters are not being treated as part of word.

4

3 回答 3

2

According to Pickaxe, the \w character class is exactly equivalent to [A-Za-z0-9_], which obviously won't include accented characters. Depending on your locale, you may find the POSIX class [:alpha:] to be what you want (I think you would use /[[:alpha:]']+/, but I may be wrong on the exact formatting of the regexp there).

于 2010-01-12T12:06:15.477 回答
2

看起来这很好用:

/[[:word:]]+/

那太容易了;)

于 2010-01-12T22:58:56.080 回答
0

您需要的是英语|德语|...标记器?自然语言中的标记化并不像寻找空格那么简单。例如,如果您想对这句话进行标记:“洛杉矶是一座美丽的城市”。如果您想在字典中找到洛杉矶,则应将其视为一个词而不是两个词。

您还应该处理标点符号 (.;?!:)、缩写词、分隔符、引号、紧缩词等...

中文或日文等语言的标记化要困难得多。

Jurafsky 和 ​​Martin 在第 3.9.1 章的“语音和语言处理”中有一个简单的英语标记化 perl 脚本。

于 2010-01-12T13:43:57.053 回答