我有一个模糊字符串匹配脚本,它在 400 万个公司名称的大海捞针中寻找大约 30K 根针。虽然脚本运行良好,但由于内存不足,我尝试通过 AWS h1.xlarge 上的并行处理来加快速度的尝试失败了。
我不想像在回答我之前的问题时解释的那样尝试获得更多内存,而是想了解如何优化工作流程 - 我对此很陌生,所以应该有足够的空间。顺便说一句,我已经尝试过队列(也工作过,但遇到了同样的MemoryError
问题,另外还查看了一堆非常有用的 SO 贡献,但还没有完全实现。
这似乎与代码最相关。我希望它充分阐明了逻辑-很高兴根据需要提供更多信息:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
所以我想问题是:我怎样才能避免为所有工人加载干草堆 - 例如通过共享数据或采取不同的方法,比如将更大的干草堆分配给工人而不是针?我怎样才能通过避免或消除混乱来提高内存使用率?