问题第一
如何尽快搜索我的 SQLite 数据库?
我是否应该解析 Excel 中所有 60,000 行的地址数据,将它们加载到列表中,然后一次搜索所有这些数据?
从简单地浏览纯文本文件切换到我的脚本加快了 3 倍,但我仍然认为它可以更快。
先感谢您!
数据库
我有一个城市名称、邮政编码、坐标等的 SQLite 数据库,这些数据库是我从 Geonames 的邮政编码数据转储中创建的: Geonames Postal Codes
该数据库为每个国家(德国、美国、英国等)都有一个表格(总共 72 个),每个表格都有几十到几万行,格式如下:
country code : iso country code, 2 characters
postal code : varchar(20)
place name : varchar(180)
admin name1 : 1. order subdivision (state) varchar(100)
admin code1 : 1. order subdivision (state) varchar(20)
admin name2 : 2. order subdivision (county/province) varchar(100)
admin code2 : 2. order subdivision (county/province) varchar(20)
admin name3 : 3. order subdivision (community) varchar(100)
admin code3 : 3. order subdivision (community) varchar(20)
latitude : estimated latitude (wgs84)
longitude : estimated longitude (wgs84)
accuracy : accuracy of lat/lng from 1=estimated to 6=centroid
工作流程
现在我在 Python 中的当前脚本如下:
- 读取 Excel 文件中的行
- 解析地址和位置数据(中间有很多其他不相关的东西)
- 在 SQLite 数据库中搜索匹配项
- 将 SQLite db 中匹配行的信息写入 .CSV 文件
Excel 文件大约有 60,000 行,每行都经过我的整个 Python 脚本(上述过程)。
我的地址数据非常不一致,包含邮政编码、城市名称和国家名称的混合。有时所有这些数据都在 Excel 行中,有时则不在。它还带有许多拼写错误和备用名称。
因此,由于数据如此不一致,而且有时人们会输入不匹配的邮政编码和城市,我目前让我的 Python 脚本尝试一堆不同的搜索查询,例如:
- 检查 [postal code] 是否与列完全匹配并且 [place name] 是否与列完全匹配
- 检查 [邮政编码] 是否完全匹配列和列中的 [地名]
- 检查 [邮政编码] 是否完全匹配列和列中的 [地名](按单词拆分)
- 检查 just[postal code] 是否与列匹配
Python 脚本
这是 Python 脚本的部分。如您所见,它似乎效率很低:
if has_country_code == True:
not_in_list = False
country = country_code.lower()+"_"
print "HAS COUNTRY"
if has_zipcode == True and has_city_name == True:
print "HAS COUNTRY2"
success = False
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name = ? COLLATE NOCASE", (zipcode, city,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name LIKE ? COLLATE NOCASE", (zipcode,"%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name IN ("+questionMarks+") COLLATE NOCASE", ([zipcode]+newCity))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? COLLATE NOCASE", (zipcode,))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
newCity.sort(key=len, reverse=True)
newCity = (["%"+w+"%" for w in newCity])
for item in newCity:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))
for row in curs:
success = True
break
break
if has_city_name == True and has_zipcode == False:
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
newCity.sort(key=len, reverse=True)
newCity = (["%"+w+"%" for w in newCity])
for item in newCity:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))
for row in curs:
success = True
break
break
if has_city_name == False and has_zipcode == True:
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ?", (zipcode,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True