A big part of the reason your code is so slow is that if line not in issuesFound:
. This requires a linear search through a huge list.
You can fix that by adding a set
of seen issues (which is effectively free to search). That reduces your time from O(NM) to O(N).
But really, you can make this even simpler by removing the if
entirely.
First, you can generate the issuesFound
list after the fact from the keys of issuesFoundCounter
. For each line in issuesFoundCounter
, you want that line, and then its knownIssues[line]
. So:
issuesFound = list(flatten((line, knownIssues[line]) for line in issuesFoundCounter))
(I'm using the flatten
recipe from the itertools
docs. You can copy that into your code, or you can just write this with itertools.chain.from_iterable
instead of flatten
.)
And that means you can just search if line not in issuesFoundCounter:
instead of in issuesFound:
, which is already a dict
(and therefore effectively free to search). But if you just use setdefault
—or, even simpler, use a defaultdict
or a Counter
instead of a dict
—you can make that automatic.
So, if issuesFoundCounter
is a Counter
, the whole thing reduces to this:
for newLogLine in newLogFile:
for line in knownIssues:
if line in newLogLine:
issuesFoundCounter[line] += 1
And you can turn that into a generator expression, eliminating the slow-ish explicit looping in Python with faster looping inside the interpreter guts. This is only going to be, say, a fixed 5:1 speedup, as opposed to the linear-to-constant speedup from the first half, but it's still worth considering:
issuesFoundCounter = collections.Counter(line
for newLogLine in newLogFile
for line in knownIssues
if line in newLogLine)
The only problem with this is that the issuesFound
list is now in arbitrary order, instead of in the order that the issues are found. If that's important, just use an OrderedCounter
instead of a Counter
. There's a simple recipe in the collections
docs, but for your case, it can be as simple as:
class OrderedCounter(Counter, OrderedDict):
pass