1

I have some code to parse an apache log file(start_search, and end_search are date strings of the format found in an apache log):

with open("/var/log/apache2/access.log",'r') as log:
    from itertools import takewhile, dropwhile
    s_log = dropwhile(lambda L: start_search not in L, log)
    e_log = takewhile(lambda L: end_search not in L, s_log)
    query = [line for line in e_log if re.search(r'GET /(.+veggies|.+fruits)',line)]

    import csv
    query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")

    import re
    veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
    fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]

The second list generator is always empty; that is, if I switch the order of the last two lines:

    fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
    veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]

the second list is always empty.

Why? (and how can I populate the fruits and veggies lists?)

4

1 回答 1

7

您只能循环遍历迭代器一次query_dict是一个迭代器,一旦扫描过 forveggies就不能再次迭代来搜索fruits.

不要在这里使用列表推导。循环query_dict 一次,检查每个条目的veggiesfruits

veggies = []
fruits = []

for x in query_dict:
    if re.search('veggies',x['url']):
         veggies.append(x)
    if re.search('fruits',x['url']):
         fruits.append(x)

替代方案是:

  • 为列表重新创建csv.DictReader()对象fruits

    query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
    veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
    query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
    fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
    

    这确实起到了双重作用;你循环整个数据集两次。

  • 用于itertools.tee()“克隆”迭代器:

    from itertools import tee
    veggies_query_dict, fruits_query_dict = tee(query_dict)
    veggies = [ x for x in veggies_query_dict if re.search('veggies',x['url']) ]
    fruits = [ x for x in fruits_query_dict if re.search('fruits',x['url']) ]
    

    这最终将所有内容缓存query_dicttee缓冲区中,同一任务需要两倍的内存,直到fruits再次清空缓冲区。

于 2013-10-26T00:23:36.363 回答