Let us take a ruby array of sentences
. Within the array we have
- Sentences containing only words
- Sentences containing phone numbers
- Sentences containing numeric values with units of measurement
- In this case we may have things that look like this:
1mL
,55mL
,1 mL
, etc
- In this case we may have things that look like this:
- Sentences containing quantities denoted as
1x
or5 x
.
I am trying to construct a ruby regexp for the gsub
or scan
functions, such that I clean up the above sentences
array to only be left with the words (1), units of measurement (3), and quantities (4) in each sentence, but clean up all non-word characters, such as phone numbers (2) and any other delimiting characters such as \t
.
I've got this so far:
sentences.map do |sentence|
sentence.gsub!(/(?:(\d+)(?:[xX])|([xX])(?:\d+)[^a-zA-Z ])/, "")
end
Unfortunately, that replaces the exact opposite of what I want to replace. And, it doesn't account for cases where units of measurement are what I want to preserve at all.
Example inputs and outputs:
input: Lavender top (6 mL size preferred)
output: Lavender top (6 mL size preferred)
input: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, 415-123-4567.
output: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, .
input: Gold top x1, Lt. Green top x 1, Lavender top x1
output: Gold top x1, Lt. Green top x 1, Lavender top x1
So, effectively, replace numbers and other non-alpha characters, but only when the numbers don't denote measurements or quantities
I've been playing on rubular for about 3 hours to no avail. I think I might be misunderstanding look-aheads completely or just missing one key gotcha moment.
Looking forward to the regexp experts chiming in!