ruby - How can I replace all non-words in a phrase, with the exception of numbers followed or preceded by characters?

Question

Let us take a ruby array of sentences. Within the array we have

Sentences containing only words
Sentences containing phone numbers
Sentences containing numeric values with units of measurement
- In this case we may have things that look like this: 1mL, 55mL, 1 mL, etc
Sentences containing quantities denoted as 1x or 5 x.

I am trying to construct a ruby regexp for the gsub or scan functions, such that I clean up the above sentences array to only be left with the words (1), units of measurement (3), and quantities (4) in each sentence, but clean up all non-word characters, such as phone numbers (2) and any other delimiting characters such as \t.

I've got this so far:

sentences.map do |sentence|
  sentence.gsub!(/(?:(\d+)(?:[xX])|([xX])(?:\d+)[^a-zA-Z ])/, "")
end

Unfortunately, that replaces the exact opposite of what I want to replace. And, it doesn't account for cases where units of measurement are what I want to preserve at all.

Example inputs and outputs:

input: Lavender top (6 mL size preferred)
output: Lavender top (6 mL size preferred)

input: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, 415-123-4567.
output: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, .

input: Gold top x1, Lt. Green top x 1, Lavender top x1
output: Gold top x1, Lt. Green top x 1, Lavender top x1

So, effectively, replace numbers and other non-alpha characters, but only when the numbers don't denote measurements or quantities

I've been playing on rubular for about 3 hours to no avail. I think I might be misunderstanding look-aheads completely or just missing one key gotcha moment.

Looking forward to the regexp experts chiming in!

score 0 · Accepted Answer

这可能是一个开始：

input.map!{|x| x.gsub(/(?<!x\s|x)[\d-]+(?!\s?\w\w?)/i, '')}

#/(?<!x\s|x)[\d-]+(?!\s?\w\w)/i

# (?<!x\s|x) Dont match if after an x or x+space
# [\d-]+ Match digits (and other junk)
# (?!\s?\w\w) Make sure it is not followed by a two letter word. Here you could be more specific if it causes trouble.
# /expression/i make the thing case insensitive.

score 0 · Accepted Answer

这适用于您的示例数据，但可能有其他情况未处理：

(?<!x\s?)\b[-.\d]+\b(?!\s*?ml)

正则表达式仅匹配415-123-4567您的示例数据中的。

ruby - How can I replace all non-words in a phrase, with the exception of numbers followed or preceded by characters?

2 回答 2

Related

Reference