ruby - 防止或删除 textscraper 中的重复项？

Question

我有一个代码可以解析文件夹中的文本文件，并在某些搜索词周围保存预定义数量的词。

例如，它会查找诸如“日期”和“年份”之类的词。如果它在同一个句子中找到两者，它将保存该句子两次。此外，如果它发现一个句子中使用了几次同一个词，它也会多次保存它。

这样，刮刀可以节省大量不必要的重复文本。

我看到两种可能的解决方案：

如果下一个搜索匹配在前一个单词组的填充中，则不会被保存。
如果一组，比如说，搜索匹配的七个单词也是前一组的一部分，它将不会被保存/删除。

到目前为止，我尝试过的一切都完全失败了：

#helper
def indices text, index, word
    padding = 200
    bottom_i = index - padding < 0 ? 0 : index - padding
    top_i = index + word.length + padding > text.length ? text.length : index +         word.length + padding
    return bottom_i, top_i
end

#script
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    curr_i = 0
    str = nil
    whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " +         File.path(textfile))
        curr_i += i_match                     
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close

score 0 · Accepted Answer

最好是比：

whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
  if i_match = whole_file.index(match, curr_i)
    top_bottom = indices(whole_file, i_match, match)
    base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " +         File.path(textfile))
    curr_i += i_match + 50                     
  end        
end

score 0 · Accepted Answer

为什么不做一些事情来跟踪你正在寻找的东西：

search_words = %w( year date etc )

然后将搜索字符串小写，并启动索引。

def summarize(str)
  search_str = str.downcase
  ind = 0

然后在 search_str 中找到您的搜索词的最小索引偏移量，并删除所有直到 (ind + offset - delta) 的内容，直到 (ind + delta) 进入匹配项，并在 while 循环中继续。就像是：

  matches = []
  while (offset = search_words.map{|w| search_str.index w }.min)
    ind += offset
    matches.push str[ind - delta, delta * 2]
    search_str = search_str[offset + delta, ]
  end
  matches
end

ruby - 防止或删除 textscraper 中的重复项？

2 回答 2

Related

Reference