下面的代码将遍历目录中的每个 .txt 文件,并将您决定使用的任何正则表达式的所有出现base.txt
以及找到它的文件的名称打印到文件中。我选择使用scan
另一种正则表达式的方法可用的方法将返回匹配结果的数组。有关扫描的 rubydoc,请参见此处。如果您只希望每个文件中出现一次,您也可以更改代码。
##
# This method takes a string, int and string as an argument.
# The method will return the indices that are padded on either side
# of the passed in index by 20 (in our case) but not padded by more
# then the size of the passed in text. The word parameter is used to
# decide the top index as we do not want to include the word in our
# padding calculation.
#
# = Example
#
# indices("hello bob how are you?", 5, "bob")
# # => [0, 13] since the text length is less than 40
#
# indices("this is a string of text that is long enough for a good example", 31, "is")
# # => [11, 53] The extra 2 account for the length of the word 'is'.
#
def indices text, index, word
#here's where you get the text from around the word you are interested in.
#I have set the padding to 20 but you can change that as you see fit.
padding = 20
#Here we are getting the lowest point at which we can retrieve a substring.
#We don't want to try and get an index before the beginning of our string.
bottom_i = index - padding < 0 ? 0 : index - padding
#Same concept as bottom except at the top end of the string.
top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
return bottom_i, top_i
end
#Script start.
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")
Dir.glob("*.txt").each do |textfile|
whole_file = File.open(textfile, 'r').read
puts "Currently summarizing " + textfile + "..."
#This is a placeholder for the 'current' index we are looking at.
curr_i = 0
str = nil
#This will go through the entire file and find each occurance of the specified regex.
whole_file.scan(/trail/).each do |match|
#This is the index of the matching string looking from the curr_i index onward.
#We do this so that we don't find and report things twice.
if i_match = whole_file.index(match, curr_i)
top_bottom = indices(whole_file, i_match, match)
base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
#We set our current index to be the index at which we found the match so when
#we ask for the matching index from curr_i onward, we don't get the same index
#again.
curr_i += i_match
#If you only want one occurrance break here
end
end
puts "Done summarizing " + textfile + "."
end
base_text.close