python - 具有相同电话号码的文档分组

Question

我的数据库由一个大的集合组成。酒店数量（约 121,000 家）。

这就是我的收藏的样子：

{
    "_id" : ObjectId("57bd5108f4733211b61217fa"),
    "autoid" : 1,
    "parentid" : "P01982.01982.110601173548.N2C5",
    "companyname" : "Sheldan Holiday Home",
    "latitude" : 34.169552,
    "longitude" : 77.579315,
    "state" : "JAMMU AND KASHMIR",
    "city" : "LEH Ladakh",
    "pincode" : 194101,
    "phone_search" : "9419179870|253013",
    "address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
    "email" : "",
    "website" : "",
    "national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
    "area" : "Leh Ladakh",
    "data_city" : "Leh Ladakh"
}

每个文档可以有 1 个或多个电话号码，以“|”分隔分隔符。

我必须将具有相同电话号码的文档组合在一起。

实时，我的意思是当用户打开特定酒店以在 Web 界面上查看其详细信息时，我应该能够显示与其链接的所有酒店，并按常用电话号码分组。

分组时，如果一家酒店链接到另一家酒店并且该酒店链接到另一家酒店，则应将所有 3 家酒店组合在一起。

示例：酒店 A 有电话号码 1|2，B 有电话号码 3|4，C 有电话号码 2|3，那么 A、B 和 C 应该组合在一起。

from pymongo import MongoClient
from pprint import pprint #Pretty print 
import re #for regex
#import unicodedata

client = MongoClient()

cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []

#We'll be passing the value later as parameter via a function call 
#hId = 37443; 

regx = re.compile("^Vivanta", re.IGNORECASE)

#Connection
db = client.hotel
collection = db.hotelData

#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
    list.append(post)

#Copying all hotels in a list
for post1 in collection.find():
    allHotels.append(post1)

hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel

try:
    splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
    splitList = y


print "Contact details of",x,":"

#Printing all contacts...
for contact in splitList:   
    print contact 
    conContact.extend(contact)
    cLen = cLen+1

print "No. of contacts in",x,"=",cLen


for i in allHotels:
    yAll = allHotels[countA]["phone_search"]
    try:
        splitListAll.append(yAll.split("|"))
        countA = countA+1
    except:
        splitListAll.append(yAll)
        countA = countA + 1
#   print splitListAll

#count = 0 

#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll: 
    if contactAll in conContact:
        conContact.extend(contactAll)
#       contactChk = contactAll
#       if (set(conContact) & set(contactChk)):
#           conContact = contactChk
#           contactChk[:] = [] #drop contactChk list
        conId = allHotels[countB]["autoid"]
    countB = countB+1

print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
    print final

这是我用 Python 编写的一段代码。在这一个中，我尝试在 for 循环中执行线性搜索。到目前为止，我遇到了一些错误，但纠正后应该可以工作。

我需要一个优化版本，因为线性搜索的时间复杂度很差。

我对此很陌生，因此欢迎任何其他改进代码的建议。

谢谢。

score 0 · Accepted Answer

任何 Python 内存搜索问题的最简单答案是“使用字典”。字典给出 O(ln N) 的密钥访问速度，列表给出 O(N)。

还请记住，您可以将 Python 对象放入尽可能多的字典（或列表）中，也可以根据需要多次放入一个字典或列表中。它们不会被复制。这只是一个参考。

所以必需品看起来像

for hotel in hotels:
   phones = hotel["phone_search"].split("|")
   for phone in phones:
       hotelsbyphone.setdefault(phone,[]).append(hotel)

在此循环结束时，hotelsbyphone["123456"]将是一个酒店对象列表，其中包含“123456”作为其phone_search字符串之一。密钥编码功能是.setdefault(key, [])如果密钥不在字典中则初始化一个空列表的方法，以便您可以附加到它。

一旦你建立了这个索引，这将很快

try:
    hotels = hotelsbyphone[x]
    # and process a list of one or more hotels
except KeyError:
    # no hotels exist with that number

或者try ... except，测试if x in hotelsbyphone:

python - 具有相同电话号码的文档分组

1 回答 1

Related

Reference