ruby - 解析文件夹中的所有文本文件，保存正则表达式搜索周围的文本

Question

我正在尝试编写一个代码来遍历目录中的所有文本文件，在搜索某些正则表达式的出现时解析它们，并保存前后 20 个左右的单词。

我使用 dir.glob 选择所有 .txt 文件，然后想为所有这些文本文件循环一个代码（每个都做），使用正则表达式来搜索一个单词的出现 (line.match?File.find_all? 和然后将单词及其周围的选择打印到基本文件中。

我试图把这一切都拼凑起来，但我不相信我已经走得太远，也没有走得更远。任何帮助深表感谢。

这就是我所拥有的：

    Dir::mkdir("summaries") unless File.exists?("summaries")
    Dir.chdir("summaries")
    all_text_files = Dir.glob("*.txt")

    all_text_files.each do |textfile|
        puts "currently summarizing " + textfile + "..."
        File.readlines(#{textfile}, "r").each do |line|
            if line.match /trail/ #does line.match work?
            if line =~ /trail/ #would this work?
                return true
                #save line to base textfile while referencing name of searchfile
            end
        end
    end

score 2 · Accepted Answer

下面的代码将遍历目录中的每个 .txt 文件，并将您决定使用的任何正则表达式的所有出现base.txt以及找到它的文件的名称打印到文件中。我选择使用scan另一种正则表达式的方法可用的方法将返回匹配结果的数组。有关扫描的 rubydoc，请参见此处。如果您只希望每个文件中出现一次，您也可以更改代码。

##
# This method takes a string, int and string as an argument.
# The method will return the indices that are padded on either side
# of the passed in index by 20 (in our case) but not padded by more
# then the size of the passed in text. The word parameter is used to
# decide the top index as we do not want to include the word in our
# padding calculation. 
#
# = Example
#
#  indices("hello bob how are you?", 5, "bob") 
#      # => [0, 13] since the text length is less than 40
#
#  indices("this is a string of text that is long enough for a good example", 31, "is")
#      # => [11, 53] The extra 2 account for the length of the word 'is'.
#    
    def indices text, index, word
    #here's where you get the text from around the word you are interested in.
    #I have set the padding to 20 but you can change that as you see fit.
    padding = 20
    #Here we are getting the lowest point at which we can retrieve a substring.
    #We don't want to try and get an index before the beginning of our string.
    bottom_i = index - padding < 0 ? 0 : index - padding

    #Same concept as bottom except at the top end of the string.
    top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
    return bottom_i, top_i
end

#Script start.
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    #This is a placeholder for the 'current' index we are looking at.
    curr_i = 0
    str = nil
    #This will go through the entire file and find each occurance of the specified regex. 
    whole_file.scan(/trail/).each do |match|
      #This is the index of the matching string looking from the curr_i index onward.
      #We do this so that we don't find and report things twice.
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
        #We set our current index to be the index at which we found the match so when
        #we ask for the matching index from curr_i onward, we don't get the same index
        #again.
        curr_i += i_match         
        #If you only want one occurrance break here            
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close

score 2 · Accepted Answer

你的代码看起来很草率。它充满了错误。以下是一些（可能还有更多）：

你在+这里缺少一个：

puts "currently summarizing " textfile + "..."

它应该是：

puts "currently summarizing " + textfile + "..."

您只能#{}在双引号内使用，所以不要使用：

File.open(#{textfile}, "r")

做就是了：

File.open(textfile, "r")

这根本没有任何意义：

File.open(#{textfile}, "r")
textfile.each do line

它应该是：

File.open(textfile, "r").each do |line|

这也没有意义：

return true
print line

line之后永远不会被打印出来return true。

编辑：

至于你的新问题：要么工作，但match有=~不同的返回值。这取决于你想要做什么。

foo = "foo trail bar"
foo.match /trail/ # => #<MatchData "trail">
foo =~ /trail/ # => 4

ruby - 解析文件夹中的所有文本文件，保存正则表达式搜索周围的文本

2 回答 2

Related

Reference