ruby - 从多个字符串中删除重复的文本

Question

我有：

a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"

我正在寻找一种可以做到的算法：

> magic(a, b, c)
=> ['A with property B and propery C', 
    'B with property X and propery Y', 
    'C having no properties']

我必须在 1000 多个文本中查找重复项。超级性能不是必须的，但会很好。

- 更新

我正在寻找单词的顺序。因此，如果：

d = 'This is Product D with text engraving: "Buy". Buy it now!'

第一个“购买”不应重复。我猜我必须使用n个单词的阈值才能被视为重复。

score 3 · Accepted Answer

def common_prefix_length(*args)
  first = args.shift
  (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end

def magic(*args)
  i = common_prefix_length(*args)
  args = args.map { |a| a[i..-1].reverse }
  i = common_prefix_length(*args)
  args.map { |a| a[i..-1].reverse }
end

a = "This is Product A with property B and propery C. Buy it now!"
b = "This is Product B with property X and propery Y. Buy it now!"
c = "This is Product C having no properties. Buy it now!"

magic(a,b,c)
# => ["A with property B and propery C",
#     "B with property X and propery Y",
#     "C having no properties"]

score 3 · Accepted Answer

您的数据

sentences = [ 
  "This is Product A with property B and propery C. Buy it now!",
  "This is Product B with property X and propery Y. Buy it now!",
  "This is Product C having no properties. Buy it now!"
]

你的魔法

def magic(data)
  prefix, postfix = 0, -1
  data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break  while true
  data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break  while true
  data.map{ |d| d[prefix..postfix] }
end

你的输出

magic(sentences)
#=> [
#=>   "A with property B and propery C",
#=>   "B with property X and propery Y",
#=>   "C having no properties"
#=> ]

或者你可以使用loop而不是while true

def magic(data)
  prefix, postfix = 0, -1
  loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break }
  loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break }
  data.map{ |d| d[prefix..postfix] }
end

score 0 · Accepted Answer

编辑：此代码有错误。只是留下我的答案以供参考，因为如果人们在被否决后删除答案，我不喜欢它。每个人都会犯错:-)

我喜欢@falsetru 的方法，但觉得代码过于复杂。这是我的尝试：

def common_prefix_length(strings)
  i = 0
  i += 1 while strings.map{|s| s[i] }.uniq.size == 1
  i
end

def common_suffix_length(strings)
  common_prefix_length(strings.map(&:reverse))
end

def uncommon_infixes(strings)
  pl = common_prefix_length(strings)
  sl = common_suffix_length(strings)
  strings.map{|s| s[pl...-sl] }
end

由于 OP 可能关心性能，我做了一个快速基准测试：

require 'fruity'
require 'securerandom'

prefix = 'PREFIX '
suffix = ' SUFFIX'
test_data = Array.new(1000) do
  prefix + SecureRandom.hex + suffix
end

def fl00r_meth(data)
  prefix, postfix = 0, -1
  data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break  while true
  data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break  while true
  data.map{ |d| d[prefix..postfix] }
end

def falsetru_common_prefix_length(*args)
  first = args.shift
  (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } }
end

def falsetru_meth(*args)
  i = falsetru_common_prefix_length(*args)
  args = args.map { |a| a[i..-1].reverse }
  i = falsetru_common_prefix_length(*args)
  args.map { |a| a[i..-1].reverse }
end

def padde_common_prefix_length(strings)
  i = 0
  i += 1 while strings.map{|s| s[i] }.uniq.size == 1
  i
end

def padde_common_suffix_length(strings)
  padde_common_prefix_length(strings.map(&:reverse))
end

def padde_meth(strings)
  pl = padde_common_prefix_length(strings)
  sl = padde_common_suffix_length(strings)
  strings.map{|s| s[pl...-sl] }
end

compare do
  fl00r do
    fl00r_meth(test_data.dup)
  end

  falsetru do
    falsetru_meth(*test_data.dup)
  end

  padde do
    padde_meth(test_data.dup)
  end
end

这些是结果：

Running each test once. Test will take about 1 second.
fl00r is similar to padde
padde is faster than falsetru by 30.000000000000004% ± 10.0%

ruby - 从多个字符串中删除重复的文本

3 回答 3

Related

Reference