我正在尝试对使用 SQLAlchemy 访问的 SQlite 数据库中的大约 200,000 个条目进行一些文本处理。我想将它并行化(我正在查看 Parallel Python),但我不确定该怎么做。
我想在每次处理条目时提交会话,这样如果我需要停止脚本,我就不会丢失它已经完成的工作。但是,当我尝试将 session.commit() 命令传递给回调函数时,它似乎不起作用。
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
def callbackFunc(match, session, inputTuple):
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))
会话是从 导入的assignDB
。我没有收到任何错误,只是没有更新数据库。
谢谢你的帮助。
更新 这是模糊子字符串的代码
def fuzzy_substring(needle, haystack):
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
return m
row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)
我从这里得到的:Fuzzy Substring。就我而言,“needle”是大约 8000 种可能的选择之一,而 haystack 是我要匹配的原始字符串。我遍历所有可能的“针”并选择得分最高的一根。