这是我第一次使用 Python 这么大,所以我需要一些帮助。
我有一个具有以下结构的mongodb(或python dict):
{
"_id": { "$oid" : "521b1fabc36b440cbe3a6009" },
"country": "Brazil",
"id": "96371952",
"latitude": -23.815124482000001649,
"longitude": -45.532670811999999216,
"name": "coffee",
"users": [
{
"id": 277659258,
"photos": [
{
"created_time": 1376857433,
"photo_id": "525440696606428630_277659258",
},
{
"created_time": 1377483144,
"photo_id": "530689541585769912_10733844",
}
],
"username": "foo"
},
{
"id": 232745390,
"photos": [
{
"created_time": 1369422344,
"photo_id": "463070647967686017_232745390",
}
],
"username": "bar"
}
]
}
现在,我想创建两个文件,一个包含摘要,另一个包含每个连接的权重。我适用于小型数据集的循环如下:
#a is the dataset
data = db.collection.find()
a =[i for i in data]
#here go the connections between the locations
edges = csv.writer(open("edges.csv", "wb"))
#and here the location data
nodes = csv.writer(open("nodes.csv", "wb"))
for i in a:
#find the users that match
for q in a:
if i['_id'] <> q['_id'] and q.get('users') :
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
if weight>0:
edges.writerow([ i['id'], q['id'], weight])
#find the number of photos
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
nodes.writerow([ i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
photos_number
])
缩放问题:我有 20000 个位置,每个位置可能有多达 2000 个用户,每个用户可能有大约 10 张照片。
有没有更有效的方法来创建上述循环?也许是多线程、JIT、更多索引?因为如果我在单个线程中运行上述内容最多可以得到 20000^2 *2000 *10 结果...
那么如何才能更有效地处理上述问题呢?谢谢