1

Let us take a ruby array of sentences. Within the array we have

  1. Sentences containing only words
  2. Sentences containing phone numbers
  3. Sentences containing numeric values with units of measurement
    • In this case we may have things that look like this: 1mL, 55mL, 1 mL, etc
  4. Sentences containing quantities denoted as 1x or 5 x.

I am trying to construct a ruby regexp for the gsub or scan functions, such that I clean up the above sentences array to only be left with the words (1), units of measurement (3), and quantities (4) in each sentence, but clean up all non-word characters, such as phone numbers (2) and any other delimiting characters such as \t.

I've got this so far:

sentences.map do |sentence|
  sentence.gsub!(/(?:(\d+)(?:[xX])|([xX])(?:\d+)[^a-zA-Z ])/, "")
end

Unfortunately, that replaces the exact opposite of what I want to replace. And, it doesn't account for cases where units of measurement are what I want to preserve at all.

Example inputs and outputs:

input:     Lavender top (6 mL size preferred)
output:   Lavender top (6 mL size preferred)

input:   Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, 415-123-4567.
output: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, .

input:   Gold top x1, Lt. Green top x 1, Lavender top x1
output: Gold top x1, Lt. Green top x 1, Lavender top x1

So, effectively, replace numbers and other non-alpha characters, but only when the numbers don't denote measurements or quantities

I've been playing on rubular for about 3 hours to no avail. I think I might be misunderstanding look-aheads completely or just missing one key gotcha moment.

Looking forward to the regexp experts chiming in!

4

2 回答 2

0

这可能是一个开始:

input.map!{|x| x.gsub(/(?<!x\s|x)[\d-]+(?!\s?\w\w?)/i, '')}

#/(?<!x\s|x)[\d-]+(?!\s?\w\w)/i

# (?<!x\s|x) Dont match if after an x or x+space
# [\d-]+ Match digits (and other junk)
# (?!\s?\w\w) Make sure it is not followed by a two letter word. Here you could be more specific if it causes trouble.
# /expression/i make the thing case insensitive.
于 2013-11-12T22:07:49.607 回答
0

这适用于您的示例数据,但可能有其他情况未处理:

(?<!x\s?)\b[-.\d]+\b(?!\s*?ml)

正则表达式仅匹配415-123-4567您的示例数据中的。

于 2013-11-13T11:23:58.363 回答