python - Alternative to Python Multiprocessing Manager dict for large read only store

Question

I'm using Multiprocessing with a large (~5G) read-only dict used by processes. I started by passing the whole dict to each process, but ran into memory restraints, so changed to use a Multiprocessing Manager dict (after reading this How to share a dictionary between multiple processes in python without locking )

Since the change, performance has dived. What alternatives are there for a faster shared data store? The dict has a 40 character string key, and 2 small string element tuple data.

score 0 · Accepted Answer

使用内存映射文件。虽然这听起来很疯狂（性能方面），但如果你使用一些聪明的技巧可能不会：

对键进行排序，以便您可以在文件中使用二进制搜索来定位记录
尝试使文件的每一行长度相同（“固定宽度记录”）

如果您不能使用固定宽度的记录，请使用以下伪代码：

Read 1KB in the middle (or enough to be sure the longest line fits *twice*)
Find the first new line character
Find the next new line character
Get a line as a substring between the two positions
Check the key (first 40 bytes)
If the key is too big, repeat with a 1KB block in the first half of the search range, else in the upper half of the search range

如果性能不够好，请考虑用 C 编写扩展。

python - Alternative to Python Multiprocessing Manager dict for large read only store

1 回答 1

Related

Reference