0

我正在尝试编写一个代码来遍历目录中的所有文本文件,在搜索某些正则表达式的出现时解析它们,并保存前后 20 个左右的单词。

我使用 dir.glob 选择所有 .txt 文件,然后想为所有这些文本文件循环一个代码(每个都做),使用正则表达式来搜索一个单词的出现 (line.match?File.find_all? 和然后将单词及其周围的选择打印到基本文件中。

我试图把这一切都拼凑起来,但我不相信我已经走得太远,也没有走得更远。任何帮助深表感谢。

这就是我所拥有的:

    Dir::mkdir("summaries") unless File.exists?("summaries")
    Dir.chdir("summaries")
    all_text_files = Dir.glob("*.txt")

    all_text_files.each do |textfile|
        puts "currently summarizing " + textfile + "..."
        File.readlines(#{textfile}, "r").each do |line|
            if line.match /trail/ #does line.match work?
            if line =~ /trail/ #would this work?
                return true
                #save line to base textfile while referencing name of searchfile
            end
        end
    end
4

2 回答 2

2

下面的代码将遍历目录中的每个 .txt 文件,并将您决定使用的任何正则表达式的所有出现base.txt以及找到它的文件的名称打印到文件中。我选择使用scan另一种正则表达式的方法可用的方法将返回匹配结果的数组。有关扫描的 ruby​​doc,请参见此处。如果您只希望每个文件中出现一次,您也可以更改代码。

##
# This method takes a string, int and string as an argument.
# The method will return the indices that are padded on either side
# of the passed in index by 20 (in our case) but not padded by more
# then the size of the passed in text. The word parameter is used to
# decide the top index as we do not want to include the word in our
# padding calculation. 
#
# = Example
#
#  indices("hello bob how are you?", 5, "bob") 
#      # => [0, 13] since the text length is less than 40
#
#  indices("this is a string of text that is long enough for a good example", 31, "is")
#      # => [11, 53] The extra 2 account for the length of the word 'is'.
#    
    def indices text, index, word
    #here's where you get the text from around the word you are interested in.
    #I have set the padding to 20 but you can change that as you see fit.
    padding = 20
    #Here we are getting the lowest point at which we can retrieve a substring.
    #We don't want to try and get an index before the beginning of our string.
    bottom_i = index - padding < 0 ? 0 : index - padding

    #Same concept as bottom except at the top end of the string.
    top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
    return bottom_i, top_i
end

#Script start.
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    #This is a placeholder for the 'current' index we are looking at.
    curr_i = 0
    str = nil
    #This will go through the entire file and find each occurance of the specified regex. 
    whole_file.scan(/trail/).each do |match|
      #This is the index of the matching string looking from the curr_i index onward.
      #We do this so that we don't find and report things twice.
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
        #We set our current index to be the index at which we found the match so when
        #we ask for the matching index from curr_i onward, we don't get the same index
        #again.
        curr_i += i_match         
        #If you only want one occurrance break here            
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close
于 2013-03-11T14:05:08.743 回答
2

你的代码看起来很草率。它充满了错误。以下是一些(可能还有更多):

你在+这里缺少一个:

puts "currently summarizing " textfile + "..."

它应该是:

puts "currently summarizing " + textfile + "..."

您只能#{}在双引号内使用,所以不要使用:

File.open(#{textfile}, "r")

做就是了:

File.open(textfile, "r")

这根本没有任何意义:

File.open(#{textfile}, "r")
textfile.each do line

它应该是:

File.open(textfile, "r").each do |line|

这也没有意义:

return true
print line

line之后永远不会被打印出来return true

编辑:

至于你的新问题:要么工作,但match=~不同的返回值。这取决于你想要做什么。

foo = "foo trail bar"
foo.match /trail/ # => #<MatchData "trail">
foo =~ /trail/ # => 4
于 2013-03-11T12:44:22.930 回答