I'm trying to come up with a simple way using Ruby to scramble (or mask) some numeric data, in order to create a dummy data set from live data. I want to keep the data as close to the original format as possible (that is, preserve all non-numeric characters). The numbers in the data correspond to individual identification numbers, which (sometimes) are keys used in a relational database. So, if the numeric string occurs more than once, I want to map it consistently to the same (ideally unique) value. Once the data has been scrambled, I don't need to be able to reverse the scrambling.
I've created a scramble function that takes a string and generates a simple hash to map numbers to new values (the function only maps the numeric digits and leaves everything else as is). For added security, each time the function is called, the key is regenerated. Thus, the same phrase will produce two different results each time the function is called.
module HashModule
def self.scramble(str)
numHash ={}
0.upto(9) do |i|
numHash[i.to_s]=rand(10).to_s
end
output= String.new(str)
output.gsub!(/\d/) do|d|
d.replace numHash[d]
end
puts "Input: " + str
puts "Hash Key: " + numHash.to_s
puts "Output: " + output
end
end
HashModule.scramble("56609-8 NO PCT 001")
HashModule.scramble("56609-8 NO PCT 001")
This produces the following output:
Input: 56609-8 NO PCT 001
Hash Key: {"0"=>"9", "1"=>"4", "2"=>"8",
"3"=>"9", "4"=>"4", "5"=>"8",
"6"=>"4", "7"=>"0", "8"=>"2",
"9"=>"1"}
Output: 84491-2 NO PCT 994
Input: 56609-8 NO PCT 001
Hash Key: {"0"=>"2", "1"=>"0", "2"=>"9",
"3"=>"8", "4"=>"4", "5"=>"5",
"6"=>"7", "7"=>"4", "8"=>"2",
"9"=>"0"}
Output: 57720-2 NO PCT 220
Given the data set:
PTO NO PC
R5632893423 IP
R566788882-001
NO PCT AMB PTO
NO AMB/CALL IP
A566788882
1655543AACHM IP
56664320000000
00566333-1
I first extract all the numbers to an array. Then I use the scramble function I created to create a replacement hash map, e.g.
{"5632893423"=>"5467106076", "566788882"=>"888299995",
"001"=>"225", "1655543"=>"2466605",
"56664320000000"=>"70007629999999",
"00566333"=>"00699999", "1"=>"3"}
[Incidentally, in my example, I haven't found a way to insist that the hash values are all unique, which is relevant in the event the string being mapped corresponds to a unique ID in a relationship database, as described above.]
I use gsub on my original string and replace the hash keys with the scrambled value. The code I have works, but I'm curious to learn how I can make it more concise. I realize by regenerating the key each time the function is called, I create extra work. (Otherwise, I could just create one key to replace all digits).
Does anyone have suggestions for how I can accomplish this another way? (I'm new to Ruby, so suggestions for improving my code are also greatly received).
input = <<EOS
PTO NO PC
R5632893423 IP
R566788882-001
NO PCT AMB PTO
NO AMB/CALL IP
A566788882
1655543AACHM IP
56664320000000
00566333-1
EOS
module HashModule
def self.scramble(str)
numHash ={}
0.upto(9) do |i|
numHash[i.to_s]=rand(10).to_s
end
output= String.new(str)
output.gsub!(/\d/) do|d|
d.replace numHash[d]
end
return output
end
end
# Extract unique non-null numbers from the input file
numbers = input.split(/[^\d]/).uniq.reject{ |e| e.empty? }
# Create a hash that maps each number to a scrambled value
# Using the function defined above
mapper ={}
numbers.map(&:to_s).each {|x| mapper[x]=HashModule.scramble(x)}
# Create a regexp to find all numbers in input file
re = Regexp.new(mapper.keys.map { |x| Regexp.escape(x) }.join('|'))
# Replace numbers with scrambled values
puts input.gsub(re, mapper)
The above code produces the following output:
PTO NO PC
R7834913043 IP
R799922223-772
NO PCT AMB PTO
NO AMB/CALL IP
A799922223
6955509AACHM IP
13330271111111
66166777-6