1

I'm new to Ruby, Regex and Stackoverflow. xD Here's my problem:

I want to use regex to extract phrases consisting of consecutive words with standard ASCII characters apart from the others in Vietnamese texts.

In another word, phrases with \w characters only, for example:

Mình rất thích con Sharp này (mặc dù chưa xài bao h nhưng chỉ nghe các pác nói mình đã thấy phê lòi mắt rồi). Các bạn cho mình hỏi 1 câu (các bạn đừng chê mình ngu nhé tội nghiệp mình) : cái máy này đem sang Anh dùng mạng Vodafone là dùng vô tư ah`? Nếu dùng được bên Anh mà không phải chọc ngoáy j thì mình mua một cái

Don't care about its meaning, what I want to achieve is an array of hashes containing the results with 2 pairs: value => the value of extracted phrases, starting_position => the position of the first character.

According to the example about, it should be like this: [{:value=>"con Sharp", :starting_position => 16}, {:value=>"bao h", :starting_position => blah blah}...]

This means that all words containing \W characters, such as "mình", "rất", "thích", etc. are rejected.

Trying above example with this regex on rubular.com for Ruby 1.9.2:

\b[\w|\s]+\b

I nearly got my desired phrases (except space-only ones), but it seems not working on my Ruby, which is also 1.9.2p290, using Win 7 64-bit.

Any ideas would be highly appreciated. Thank you beforehand.

4

1 回答 1

1

根据 rubular,它看起来像\w 匹配所有 ascii 字母和数字(和下划线),但\b对所有 Unicode 字母都适用。这有点令人困惑。

但是,您想要的是所有 ASCII 字的序列。这应该与它们匹配:

/\b[a-z]+\b(?:\s+[a-z]+)*\b/i

工作示例:http ://www.rubular.com/r/1iewl7MpJe

快速解释:

  • \b[a-z]+\b- 第一个 ASCII 字。
  • (?:\s+[a-z]+)- 任意数量的空格和单词 - 每次至少一个空格和一个字母。
  • \b- 确保最后一个词不会在另一个词的中间结束,例如nin "con Sharp này"

我不确定是否要获取哈希,但您可以获取所有MatchDatas,类似于:
如何获取字符串中所有出现的 Ruby 正则表达式的匹配数据?

s = "hello !@# world how a9e you"
r = /\b[a-z]+\b(?:\s+[a-z]+)*\b/i

matches = s.to_enum(:scan, r).map { Regexp.last_match }
           .map {|match| [match.to_s(), match.begin(0)]}
puts matches 

这是 ideone 的一个例子:http: //ideone.com/YRZE5

于 2012-03-30T12:17:53.063 回答