python - pymongo：更高效的更新

Question

我正在尝试将一些大文件（大约 400 万条记录）推送到 mongo 实例中。我基本上想要实现的是用文件中的数据更新现有数据。该算法看起来像：

rowHeaders = ('orderId', 'manufacturer', 'itemWeight')
for row in dataFile:
    row = row.strip('\n').split('\t')
    row = dict(zip(rowHeaders, row))

    mongoRow = mongoCollection.find({'orderId': 12344})
    if mongoRow is not None:
        if mongoRow['itemWeight'] != row['itemWeight']:
            row['tsUpdated'] = time.time()
    else:
        row['tsUpdated'] = time.time()

    mongoCollection.update({'orderId': 12344}, row, upsert=True)

因此，如果权重相同，则更新除 'tsUpdated' 之外的整行，如果该行不在 mongo 中，则添加新行或更新包括 'tsUpdated' 在内的整行......这就是算法

问题是：从 mongo 的角度来看，这可以更快、更容易、更高效地完成吗？（最终使用某种批量插入）

score 6 · Accepted Answer

将唯一索引orderId与更新查询相结合，您还可以在其中检查itemWeight. orderId如果时间戳已存在且相同，则唯一索引可防止插入仅具有修改的时间戳itemWeight。

mongoCollection.ensure_index('orderId', unique=True)
mongoCollection.update({'orderId': row['orderId'],
    'itemWeight': {'$ne': row['itemWeight']}}, row, upsert=True)

我的基准测试显示您的算法性能提高了 5-10 倍（取决于插入量与更新量）。

python - pymongo：更高效的更新

1 回答 1

Related

Reference