0

我有 2 个集合(A 和 B),每个集合约 70,000 个文档。如果我要比较 A 和 B,95% 的文档会相同,只有 5% 会不同。A和B中每个文档的结构完全相同。A是一个常量集合,B是一个临时集合。我想将 B 合并到 A。如果来自 B 的文档存在于 A --> 仅更新“dateLastSeen”字段。如果 B 中的文档在 A 中不存在 --> 将此文档插入 A。

...我正在使用 Python 驱动程序(如果重要的话)。

这样做最有效的方法是什么?谢谢你。

4

1 回答 1

0

The most efficient in terms of queries would be to bulk update all the dates that need to be updated in one go per date and bulk insert all those documents that need inserting.

Given you have 95% the same documents where you want to update A.dateLastSeen to be B.dateLastSeen. With single updates that would be: ~66,500 updates. Leaving ~3,500 inserts.

Loading all B and A in memory - then processing is one possibility.

You could create a bulk insert list and append anytime a doc from B is missing from A. Also a bulk update dictionary keyed by dateLastSeen containing a list of Documents to update. Depends on the probability of any matching dateLastSeen values to see if this is really worth it.

Alternatively, simplify it an accept the high query cost and start processing B in batches of 1000, load the equivalent 1000 in A and compare and update / bulk insert. Keeps the memory print down and adds only ~210 extra queries in total for fetching the batches of data (~70 batch fetches from B, ~70 from A and ~70 bulk inserts).

于 2013-06-13T10:08:10.060 回答