python - 遍历数组并在文件中搜索数组中的每个项目

Question

我不知道我是否以正确的方式问这个问题，但我想搜索一个日志文件并查找数组中的每个单词。在这一点上，我已经要求用户将有问题的文件拖到终端中，然后用输入构建一个数组。程序应该打印出找到单词的每一行。

一旦我得到这个工作，我将格式化，有一个计数器，或者对我在文件中找到的内容做一个小总结，等等。

这是我到目前为止所得到的，只有当我运行它时，它实际上并没有找到任何单词。我一直在查看 re 用法示例，但我认为对于我的想法可能过于复杂：

def wordsToFind():
    needsWords = True
    searchArray = []
    print "Add words to search ('done') to save/continue."
    while needsWords == True:
        word = raw_input("Enter a search word: ")
        if word.lower() == "done":
            needsWords = False
            break
        else:
            searchArray.append(word)
            print word + " added"
    return searchArray

def getFile():
    file_to_read = raw_input("Drag file here:").strip()
    return file_to_read

def main():
    filePath = getFile()
    searchArray = wordsToFind()
    print "Words searched for: ", searchArray
    searchCount = []

    with open(filePath, "r") as inFile:
        for line in inFile:
            for item in searchArray:
                if item in line:
                    print item


main()

显然，这里强烈欢迎任何优化建议或更好的 python 编码建议，我只知道我知道的，感谢所有帮助！

score 2 · Accepted Answer

这正是 map-reduce 旨在解决的问题。如果您不熟悉，map-reduce 是一个简单的两步过程。假设您有一个列表，其中存储了您有兴趣在文本中找到的单词。您的映射器函数可以针对文本的每一行遍历此单词列表，如果它出现在该行中，它会返回一个值，例如 ['word', lineNum] ，该值存储在结果列表中。映射器本质上是一个 for 循环的包装器。然后，您可以通过编写一个 reducer 函数来获取您的结果列表并“减少”它，在这种情况下，您可以获取结果列表，它应该看起来像 [['word1', 1]...['word1', n] ...] 变成一个看起来像 {'word1': [1, 2, 5], 'word3': [7], ...} 的对象。

这种方法是有利的，因为您在对每个项目执行通用操作的同时抽象了迭代列表的过程，并且如果您的分析需求发生变化（就像他们经常做的那样），您只需要更改映射器/归约函数而无需触及其余部分编码。此外，这种方法是高度可并行化的，如果它成为一个问题（只要问谷歌！）。

Python 3.x 有内置的 map/reduce 方法，如 map() 和 reduce()；在 python 文档中查找它们。所以你可以看到它们是如何工作的，我根据你的问题实现了一个 map/reduce 版本，而不使用内置库。由于您没有指定数据的存储方式，因此我对此做了几个假设，即感兴趣的单词列表将以逗号分隔的文件形式提供。为了读取文本文件，我使用 readlines() 来获取行数组，并使用正则表达式模式将行拆分为单词（即，拆分任何非字母数字字符）。当然，这可能不适合您的需求，因此您可以将其更改为对您正在查看的文件有意义的任何内容。

我试图远离深奥的 python 特性（没有 lambdas！），所以希望实现是清晰的。最后一点，我使用循环来遍历文本文件的行，并使用 map 函数来遍历感兴趣的单词列表。您可以改用嵌套映射函数，但我想跟踪循环索引（因为您关心行号）。如果您真的想嵌套映射函数，您可以在读取文件时将数组行存储为行和行号的元组，或者您可以修改映射函数以返回索引，您可以选择。

我希望这有帮助！

    #!usr/bin/env/ python

    #Regexp library
    import re

    #Map
    #This function returns a new array containing
    #the elements after that have been modified by whatever function we passed in.
    def mapper(function, sequence):

        #List to store the results of the map operation
        result = []

        #Iterate over each item in sequence, append the values to the results list
        #after they have been modified by the "function" supplied as an argument in the
        #mapper function call.
        for item in sequence:
            result.append(function(item))

        return result

    #Reduce
    #The purpose of the reduce function is to go through an array, and combine the items 
    #according to a specified function - this specified function should combine an element 
    #with a base value
    def reducer(function, sequence, base_value):

        #Need to get an base value to serve as the starting point for the construction of 
        #the result
        #I will assume one is given, but in most cases you should include extra validation 
        #here to either ensure one is given, or some sensible default is chosen

        #Initialize our accumulative value object with the base value
        accum_value = base_value

        #Iterate through the sequence items, applying the "function" provided, and 
        #storing the results in the accum_value object
        for item in sequence:
            accum_value = function(item, accum_value)

        return accum_value

    #With these functions it should be sufficient to address your problem, what remains 
    #is simply to get the data from the text files, and keep track of the lines in 
    #which words appear
    if __name__ == 'main':

        word_list_file = 'FILEPATH GOES HERE'

        #Read in a file containing the words that will be searched in the text file 
        #(assumes words are given as a comma separated list)
        infile = open(word_list_file, 'rt')    #Open file
        content = infile.read()     #read the whole file as a single string
        word_list = content.split(',')  #split the string into an array of words
        infile.close()

        target_text_file = 'FILEPATH GOES HERE'

        #Read in the text to analyze
        infile = open(target_text_file, 'rt')   #Open file
        target_text_lines = infile.readlines()    #Read the whole file as an array of lines
        infile.close()

        #With the data loaded, the overall strategy will be to loop over the text lines, and 
        #we will use the map function to loop over the the word_list and see if they are in 
        #the current text file line

        #First, define the my_mapper function that will process your data, and will be passed to
        #the map function
        def my_mapper(item):

            #Split the current sentence into words
            #Will split on any non alpha-numeric character. This strategy can be revised 
            #to find matches to a regular expression pattern based on the words in the 
            #words list. Either way, make sure you choose a sensible strategy to do this.
            current_line_words = re.split(r'\W+', target_text_lines[k])

            #lowercase the words
            current_line_words = [word.lower() for word in current_line_words]

            #Check if the current item (word) is in the current_line_words list, and if so,
            #return the word and the line number
            if item in current_line_words:
                return [item, k+1]    #Return k+1 because k begins at 0, but I assume line
                                      #counting begins with 1?
            else:
                return []   #Technically, this does not need to be added, it can simply 
                            #return None by default, but that requires manually handling iterator 
                            #objects so the loop doesn't crash when seeing the None values, 
                            #and I am being lazy :D

        #With the mapper function established, we can proceed to  loop over the text lines of the 
        #array, and use our map function to process the lines against the list of words.

        #This array will store the results of the map operation
        map_output = []

        #Loop over text file lines, use mapper to find which words are in which lines, store 
        #in map_output list. This is the exciting stuff!
        for k in range(len(target_text_lines)):
            map_output.extend(mapper(my_mapper, word_list))

        #At this point, we should have a list of lists containing the words and the lines they 
        #appeared in, and it should look like, [['word1', 1] ... ['word25': 5] ... [] ...]
        #As you can see, the post-map array will have an entry for each word that appeared in 
        #each line, and if a particular word did not appear in a particular line, there will be a
        #empty list instead.

        #Now all that remains is to summarize our data, and that is what the reduce function is 
        #for. We will iterate over the map_output list, and collect the words and which lines 
        #they appear at in an object that will have the format { 'word': [n1, n2, ...] },where 
        #n1, n2, ... are the lines the word appears in. As in the case for the mapper
        #function, the output of the reduce function can be modified in the my_reducer function 
        #you supply to it. If you'd rather it return something else (like say, word count), this
        #is the function to modify.

        def my_reducer(item, accum_value):
            #First, verify item is not empty
            if item != []:
                #If the element already exists in the output object, append the current line 
                #value to it, if not, add it to the object and create a set holding the current 
                #line value

                #Check this word/line combination isn't already stored in the output dict
                if (item[0] in accum_value) and (item[1] not in accum_value[item[0]]):
                    accum_value[item[0]].append(item[1])
                else:
                    accum_value[item[0]] = [item[1]]

            return accum_value

        #Now we can call the reduce function, save it's output, print it to screen, and we're  
        #done!
        #(Note that for base value we are just passing in an empty object, {})
        reduce_results = reducer(my_reducer, map_output, {})

        #Print results to screen
        for result in reduce_results:
            print('word: {}, lines: {}'.format(result, reduce_results[result]))

score 1 · Accepted Answer

你可以这样做：

a = ['foo', 'bar', 'cox', 'less', 'more']
b = ['foo', 'cox', 'complex', 'list']
c = list(set(a).intersection(set(b)))

这样 c 将是：

['cox', 'foo']

实现这一点的另一种方法是使用 python 理解：

c = [x for x in a if x in b]

我不测试哪种方式最快，但我认为是使用集合...

python - 遍历数组并在文件中搜索数组中的每个项目

2 回答 2

Related

Reference