python - 过滤海量linux日志文件

Question

我让用户在列表中输入要过滤掉的项目。从那里它过滤使用：

while knownIssuesCounter != len(newLogFile):
    for line in knownIssues:
        if line in newLogFile[knownIssuesCounter]:
            if line not in issuesFound:
                issuesFoundCounter[line]=1
                issuesFound.append(line)
                issuesFound.append(knownIssues[line])
            else:
                issuesFoundCounter[line]=issuesFoundCounter[line] + 1
    knownIssuesCounter +=1

我遇到了一百兆的日志文件，而且它需要永远.....有没有更好的方法来用 Python 做到这一点？

score 1 · Accepted Answer

尝试issuesFound从列表更改为设置：

issuesFound = set()

并使用add而不是append：

issuesFound.add(line)

score 0 · Accepted Answer

A big part of the reason your code is so slow is that if line not in issuesFound:. This requires a linear search through a huge list.

You can fix that by adding a set of seen issues (which is effectively free to search). That reduces your time from O(NM) to O(N).

But really, you can make this even simpler by removing the if entirely.

First, you can generate the issuesFound list after the fact from the keys of issuesFoundCounter. For each line in issuesFoundCounter, you want that line, and then its knownIssues[line]. So:

issuesFound = list(flatten((line, knownIssues[line]) for line in issuesFoundCounter))

(I'm using the flatten recipe from the itertools docs. You can copy that into your code, or you can just write this with itertools.chain.from_iterable instead of flatten.)

And that means you can just search if line not in issuesFoundCounter: instead of in issuesFound:, which is already a dict (and therefore effectively free to search). But if you just use setdefault—or, even simpler, use a defaultdict or a Counter instead of a dict—you can make that automatic.

So, if issuesFoundCounter is a Counter, the whole thing reduces to this:

for newLogLine in newLogFile:
    for line in knownIssues:
        if line in newLogLine:
            issuesFoundCounter[line] += 1

And you can turn that into a generator expression, eliminating the slow-ish explicit looping in Python with faster looping inside the interpreter guts. This is only going to be, say, a fixed 5:1 speedup, as opposed to the linear-to-constant speedup from the first half, but it's still worth considering:

issuesFoundCounter = collections.Counter(line 
                                         for newLogLine in newLogFile
                                         for line in knownIssues
                                         if line in newLogLine)

The only problem with this is that the issuesFound list is now in arbitrary order, instead of in the order that the issues are found. If that's important, just use an OrderedCounter instead of a Counter. There's a simple recipe in the collections docs, but for your case, it can be as simple as:

class OrderedCounter(Counter, OrderedDict):
    pass

python - 过滤海量linux日志文件

2 回答 2

Related

Reference