2

I'm trying to come up with a simple way using Ruby to scramble (or mask) some numeric data, in order to create a dummy data set from live data. I want to keep the data as close to the original format as possible (that is, preserve all non-numeric characters). The numbers in the data correspond to individual identification numbers, which (sometimes) are keys used in a relational database. So, if the numeric string occurs more than once, I want to map it consistently to the same (ideally unique) value. Once the data has been scrambled, I don't need to be able to reverse the scrambling.

I've created a scramble function that takes a string and generates a simple hash to map numbers to new values (the function only maps the numeric digits and leaves everything else as is). For added security, each time the function is called, the key is regenerated. Thus, the same phrase will produce two different results each time the function is called.

module HashModule
  def self.scramble(str)
    numHash ={}
    0.upto(9) do |i|
      numHash[i.to_s]=rand(10).to_s
    end

    output= String.new(str)
    output.gsub!(/\d/) do|d|
      d.replace numHash[d]
    end

    puts "Input: " + str
    puts "Hash Key: " + numHash.to_s
    puts "Output: " + output
  end
end

HashModule.scramble("56609-8 NO PCT 001")
HashModule.scramble("56609-8 NO PCT 001")

This produces the following output:

Input: 56609-8 NO PCT 001
Hash Key: {"0"=>"9", "1"=>"4", "2"=>"8", 
           "3"=>"9", "4"=>"4", "5"=>"8", 
           "6"=>"4", "7"=>"0", "8"=>"2", 
           "9"=>"1"}
Output: 84491-2 NO PCT 994

Input: 56609-8 NO PCT 001
Hash Key: {"0"=>"2", "1"=>"0", "2"=>"9", 
           "3"=>"8", "4"=>"4", "5"=>"5", 
           "6"=>"7", "7"=>"4", "8"=>"2", 
           "9"=>"0"}
Output: 57720-2 NO PCT 220

Given the data set:

PTO NO PC
R5632893423 IP
R566788882-001
NO PCT AMB PTO
NO AMB/CALL IP
A566788882
1655543AACHM IP
56664320000000
00566333-1

I first extract all the numbers to an array. Then I use the scramble function I created to create a replacement hash map, e.g.

 {"5632893423"=>"5467106076", "566788882"=>"888299995", 
  "001"=>"225", "1655543"=>"2466605", 
  "56664320000000"=>"70007629999999", 
  "00566333"=>"00699999", "1"=>"3"}

[Incidentally, in my example, I haven't found a way to insist that the hash values are all unique, which is relevant in the event the string being mapped corresponds to a unique ID in a relationship database, as described above.]

I use gsub on my original string and replace the hash keys with the scrambled value. The code I have works, but I'm curious to learn how I can make it more concise. I realize by regenerating the key each time the function is called, I create extra work. (Otherwise, I could just create one key to replace all digits).

Does anyone have suggestions for how I can accomplish this another way? (I'm new to Ruby, so suggestions for improving my code are also greatly received).

input = <<EOS
PTO NO PC
R5632893423 IP
R566788882-001
NO PCT AMB PTO
NO AMB/CALL IP
A566788882
1655543AACHM IP
56664320000000
00566333-1
EOS

module HashModule
  def self.scramble(str)
    numHash ={}
    0.upto(9) do |i|
      numHash[i.to_s]=rand(10).to_s
    end

    output= String.new(str)
    output.gsub!(/\d/) do|d|
      d.replace numHash[d]
    end
    return output
  end
end

# Extract unique non-null numbers from the input file
numbers = input.split(/[^\d]/).uniq.reject{ |e| e.empty? }

# Create a hash that maps each number to a scrambled value
# Using the function defined above

mapper ={}
numbers.map(&:to_s).each {|x| mapper[x]=HashModule.scramble(x)}

# Create a regexp to find all numbers in input file
re = Regexp.new(mapper.keys.map { |x| Regexp.escape(x) }.join('|'))

# Replace numbers with scrambled values
puts input.gsub(re, mapper)

The above code produces the following output:

PTO NO PC
R7834913043 IP
R799922223-772
NO PCT AMB PTO
NO AMB/CALL IP
A799922223
6955509AACHM IP
13330271111111
66166777-6
4

2 回答 2

1

除了出色的@sawa 的回答,我还建议您直接在String课堂上“注入”这种打乱方法(使str.scramble项目范围内可用,而无需放弃任何额外的屈膝礼):

class String
  @@ScrambleKey = Hash[(0..9).map(&:to_s).zip((0..9).to_a.shuffle)]
  def scramble ; self.gsub(/\d/) { @@ScrambleKey [$&] } end
end

这个实现引入了一个类变量,而不是一个实例变量。如果您需要ScrambleKey从字符串到字符串不同,请改用实例变量。

产量:

input = <<EOS
PTO NO PC
R5632893423 IP
R566788882-001
NO PCT AMB PTO
NO AMB/CALL IP
A566788882
1655543AACHM IP
56664320000000
00566333-1
EOS

puts input.scramble

给出:

PTO NO PC
R1548024784 IP
R155600008-339
NO PCT AMB PTO
NO AMB/CALL IP
A155600008
9511174AACHM IP
15557483333333
33155444-9
于 2013-02-11T09:27:41.500 回答
1

也许是这样的:

module HashModule
  ScrambleKey = Hash[(0..9).map(&:to_s).zip((0..9).to_a.shuffle)]
  def self.scramble(str); str.gsub(/\d/){ScrambleKey[$&]} end
end

puts HashModule.scramble(input)

这使:

PTO NO PC
R6907580170 IP
R699455557-223
NO PCT AMB PTO
NO AMB/CALL IP
A699455557
3966610AACHM IP
69991072222222
22699000-3
于 2013-02-11T08:50:27.877 回答