0
Two files with the same structure (first file =  unique field/index)

File X 
1,'a1','b1'
2,'a2','b20'
3,'a3','b3'
4,'a4','b4'

File Y
1,'a1','b1'
2,'a2','b2'
3,'a30','b3'
5,'a5','b5'

Goal: identify differences between these files. There are a lot of fields to compare in each file.

Requested output (maybe there is a better way to present it):

Index   X:a   X:b      Y:a   Y:b    Result

=====   ===   ===      ===   ===    ======
1       a1    b1       a1   b1      No diff
2       a2    b20      a2   b2      Diff in field b (Xb=b20, Yb=b2)
3       a3    b3       a30  b3      Diff in field a (Xa=a3,  Ya=a30
4       a4    b4       null null    missing entries in file Y
5       null  null     a5   b5      missing entries in file X

Ruby 代码 - 我目前所拥有的:

x = [[1,'a1','b1'], [2,'a2','b20'], [3,  'a3', 'b3'], [4, 'a4', 'b4']]
y = [[1,'a1','b1'], [2,'a2','b2'],  [3, 'a30', 'b3'], [5, 'a5', 'b5']]

h = Hash.new(0)

x.each {|e|
  h[e[0]] = 1
  }
y.each {|e|
  h[e[0]] = 1
  }

x.each {|e|
  p e[0]
}

我已经在 hash = h 中拥有来自两个数组的所有键(索引)它似乎是某种使用索引作为公共键的 SQL 连接。你能给我一些关于如何迭代两个数组以找到差异的方向吗?

4

1 回答 1

0

比较两个文件的问题是老问题。在 40 年前打孔卡的时代,我们已经不得不解决它来打印每天销售的物品的账单。一个文件是客户文件(主要文件),第二个是从交货单上打孔的卡片组(次要文件)。此辅助文件中的每条记录(卡)都包含客户编号和项目编号。两个文件都按客户编号排序,算法称为匹配。它包括从每个文件中读取一条记录,比较公共密钥,并选择三种可能的情况之一:

  1. primary key < secondary key : 跳过这个客户(正常,客户档案中的客户比今天的销售额多)
    读取下一条主记录
  2. 主键 = 辅助键:打印账单
    读取下一个客户记录
    从辅助文件读取并打印项目,直到客户编号更改
  3. 主键 > 辅助键:辅助文件或新客户中的错字,尚未添加到客户文件中
    打印错误消息(不是有效客户)
    读取下一条辅助记录

只要有要读取的记录,即只要两个文件都不在 EOF(文件末尾),读取循环就会继续。我用 Ruby 编写的一个更大的匹配模块的核心部分是:

def matching(p_actionSmaller, p_actionEqual, p_actionGreater)
    read_primary
    read_secondary

    while ! @eof_primary || ! @eof_secondary
        case
        when @primary_key < @secondary_key
            p_actionSmaller.call(self)
            read_primary
        when @primary_key == @secondary_key
            p_actionEqual.call(self)
            read_primary
            read_secondary
        when @primary_key > @secondary_key
            p_actionGreater.call(self)
            read_secondary
        end
    end
end

这是适用于您的阵列问题的简化版本:

# input "files" :
x = [               [2,'a2','b20'], [3,  'a3', 'b3'], [4,'a4','b4']                 ]
y = [[1,'a1','b1'], [2,'a2','b2' ], [3, 'a30', 'b3'],                [5, 'a5', 'b5']]
puts '--- input --- :'
print 'x='; p x
print 'y='; p y

xh = Hash.new
yh = Hash.new

# converted to hash for easy extraction of data :
x.each do |a|
    key, *value = a
    xh[key] = value
end

y.each do |a|
    key, *value = a
    yh[key] = value
end

puts '--- as hash --- :'
print 'xh='; p xh
print 'yh='; p yh

# sort keys for matching both "files" on the same key :
@xkeys = xh.keys.sort
@ykeys = yh.keys.sort

print '@xkeys='; p @xkeys
print '@ykeys='; p @ykeys

# simplified algorithm, where EOF is replaced by HIGH_VALUE :
@x_index = -1
@y_index = -1
HIGH_VALUE = 255

def read_primary
    @x_index += 1 # read next record
        # The primary key is extracted from the record.
        # At EOF it is replaced by HIGH_VALUE, usually x'FFFFFF'
    @primary_key = @xkeys[@x_index] || HIGH_VALUE
        # @xkeys[@x_index] returns nil if key does not exist, nil || H returns H
end

def read_secondary
    @y_index += 1
    @secondary_key = @ykeys[@y_index] || HIGH_VALUE
end

puts '--- matching --- :'
read_primary
read_secondary

while @x_index < @xkeys.length || @y_index < @ykeys.length
    case
    when @primary_key < @secondary_key
        puts "case < : #{@primary_key} < #{@secondary_key}"
        puts "x #{xh[@primary_key].inspect} has no equivalent in y"
        read_primary
    when @primary_key == @secondary_key
        puts "case = : #{@primary_key} = #{@secondary_key}"
        puts "compare #{xh[@primary_key].inspect} with #{yh[@primary_key].inspect}"
        read_primary
        read_secondary
    when @primary_key > @secondary_key
        puts "case > : #{@primary_key} > #{@secondary_key}"
        puts "y #{yh[@secondary_key].inspect} has no equivalent in x"
        read_secondary
    end
end

执行 :

$ ruby -w t.rb
--- input --- :
x=[[2, "a2", "b20"], [3, "a3", "b3"], [4, "a4", "b4"]]
y=[[1, "a1", "b1"], [2, "a2", "b2"], [3, "a30", "b3"], [5, "a5", "b5"]]
--- as hash --- :
xh={2=>["a2", "b20"], 3=>["a3", "b3"], 4=>["a4", "b4"]}
yh={5=>["a5", "b5"], 1=>["a1", "b1"], 2=>["a2", "b2"], 3=>["a30", "b3"]}
@xkeys=[2, 3, 4]
@ykeys=[1, 2, 3, 5]
--- matching --- :
case > : 2 > 1
y ["a1", "b1"] has no equivalent in x
case = : 2 = 2
compare ["a2", "b20"] with ["a2", "b2"]
case = : 3 = 3
compare ["a3", "b3"] with ["a30", "b3"]
case < : 4 < 5
x ["a4", "b4"] has no equivalent in y
case > : 255 > 5
y ["a5", "b5"] has no equivalent in x

我将差异的介绍留给您。
高温高压

于 2013-01-31T15:01:25.890 回答