ruby-on-rails - 给定 2 个字符串，获取长度大于 x 的所有单独子字符串短语的数组

Question

假设我们有一个字符串数组：

["carflam fizz peanut butter", "fizz foo", "carflam foo peanut butter"]

function 的输出get_array_of_substrings_larger_than(min)，with get_array_of_substrings_larger_than(3)，应该是["peanut butter", "carflam", "fizz"]，因为至少有 2 个字符串共享这些元素中的每一个。

我不太清楚如何写这个。请注意，这与简单地将每个字符串与其他字符串进行比较并获取最大的子字符串不同——在上面的示例中，carflam 始终是第二个最大的子字符串。

"peanut butter"是在一起的，因为当你比较"carflam fizz peanut butter"and时"carflam foo peanut butter"，最大的公共子字符串是"peanut butter". 第二大子字符串是"carflam"，它们都应该独立地在输出中，但是，"peanut"不"butter"应该在输出中，因为它们都包含在更大的子字符串中

谢谢您的帮助

score 1 · Accepted Answer

因此，首先，我认为要明确您的要求是最大的短语，因为没有更好的词。我在示例数组中看到的最大子字符串实际上是"carflam f"and " peanut butter"。ary如果这是您使用的任何类中的已知数量，请随时更改参数：

def get_array_of_phrases_larger_than(ary, min)
  all = []

  # Ugly, but this will span the range of possible phrases for each item in the
  # array, building them into a one-dimensional array if they meet the minimum
  # length requirements
  ary.each do |phrase|
    words = phrase.split
    last = words.length - 1
    (0..last).each do |from|
      (from..last).each do |to|
        p = words[from..to].join(" ")
        all << p if p.size > min
      end
    end
  end

  # Get a list of all repeated keys
  repeated = all.group_by(&:to_s).select { |_, v| v.size > 1 }
  keys = repeated.keys

  # Get a list of the longest keys, such that we exclude "peanut" and "butter"
  # if "peanut butter" exists
  longest = repeated.select do |key, _|
    keys.select { |k| k.include?(key) }.size == 1
  end

  # Sort in reverse order by length
  longest.keys.sort_by { |k| -k.size }
end

@ary = ["carflam fizz peanut butter", "fizz foo", "carflam foo peanut butter"]

get_array_of_phrases_larger_than @ary, 3
# => ["peanut butter", "carflam", "fizz"]

请注意，这与字符串的来源无关，因此您可能会遇到类似["butter butter", "foo", "baz"]返回的误报["butter"]，但我将把它作为练习留给读者。

score 0 · Accepted Answer

也不get_array_of_substrings_larger_than(3)需要输出fizz，因为它比 3 长并且它在 2 个字符串中？

为了解决这个问题，您可以成对比较字符串并找到最长的公共子序列：http ://en.wikipedia.org/wiki/Longest_common_subsequence_problem

要找到第二长的公共子序列，您可以从字符串中删除最长的公共子序列，因此如果找到peanut butter，要比较的新字符串是"carflam fizz"和"carflam foo"。

score 0 · Accepted Answer

从这里开始：

    a = ["carflam fizz peanut butter", "fizz foo", "carflam foo peanut butter"]

    phrases = {}

    index = 0;
    a.each do |s|
      words = s.split(' ')
      len = words.length
      (1..len).each do |phrase_len|
        (0..(len-phrase_len)).each do |start_word|
          if start_word >= 0
            phrase = words[start_word, phrase_len].join(' ')
            if phrases[phrase].nil?
              phrases[phrase] = []
            end
            phrases[phrase] << "(#{index}) #{s}";
          end
        end
      end
      index = index + 1;
    end

    phrases.each_pair do |phrase, indexes|
      puts "found *#{phrase}* of #{indexes.size} elements"
    end

这为您提供了从短语到发生这种情况的索引的地图，我相信您可以让它做您想做的事〜

score 0 · Accepted Answer

def get_array_of_phrases_larger_than(s, min)
    arr = []
    s.combination(2).to_a.each do |x;y,z|
    z = x.first.split(' ').select {|el| el.length > min }
    y = x.last.split(' ').select {|el| el.length > min }
        if (z & y).length == 1
             arr << (z & y).join(" ")
        else
            m = (z & y)
            (2..(m.length-1)).to_a.reverse.each do |l|
                m.combination(l).each do |com|
                 break arr << com.join(" "), arr << (m - com).join(" ") if (x.first.include? com.join(" ")) && (x.last.include? com.join(" "))
                end
            end
        end
    end
    p arr
end

s = ["carflam fizz peanut butter", "fizz foo", "carflam foo peanut butter"]
get_array_of_phrases_larger_than(s,2)
get_array_of_phrases_larger_than(s,3)
get_array_of_phrases_larger_than(s,4)

输出：

["fizz", "peanut butter", "carflam", "foo"]
["fizz", "peanut butter", "carflam"]
["peanut butter", "carflam"]

ruby-on-rails - 给定 2 个字符串，获取长度大于 x 的所有单独子字符串短语的数组

4 回答 4

Related

Reference