python - 将 3,795,790,711 个唯一键值对写入 redis

Question

我想使用一个相当大的语料库。它的名称为 web 1T-gram。它有大约 3 万亿个代币。这是我第一次使用 redis，我正在尝试编写所有 key:value 对，但它花费的时间太长。我的最终目标是使用几个 redis 实例来存储语料库，但是，目前，我坚持将其全部写在一个实例上。

我不确定，但有什么方法可以加快写作过程吗？截至目前，我只在具有 64G RAM 的机器上编写单个 redis 实例。我在想是否有一些缓存大小设置可以最大化用于redis。或者那些线上的东西？

谢谢。

作为参考，我编写了以下代码：

import gzip
import redis
import sys
import os
import time
import gzip
r = redis.StrictRedis(host='localhost',port=6379,db=0)
startTime = time.time()
for l in os.listdir(sys.argv[1]):
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        print l
        for line in infile:
                parts = line.split('\t')
                #print parts[0],' ',parts[1]
                r.set(parts[0],int(parts[1].rstrip('\n')))
r.bgsave()
print time.time() - startTime, ' seconds '

更新：

我阅读了有关大规模插入的信息，并一直在尝试这样做，但也一直失败。这是脚本的更改：

def gen_redis_proto(*args):
    proto = ''
    proto += '*' + str(len(args)) + '\r\n'
    for arg in args:
        proto += '$' + str(len(arg)) + '\r\n'
        proto += str(arg) + '\r\n'
    return proto
import sys
import os
import gzip
outputFile = open(sys.argv[2],'w')



for l in os.listdir(sys.argv[1]):
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        for line in infile:
                parts = line.split('\t')
                key = parts[0]
                value = parts[1].rstrip('\n')
                #outputFile.write(gen_redis_proto('SET',key,value))
                print gen_redis_proto('SET',key,value)

        infile.close()
        print 'done with file ',l

生成方法的功劳归于 github 用户。我没有写。

如果我运行这个，

ERR wrong number of arguments for 'set' command
ERR unknown command '$18'
ERR unknown command 'ESSPrivacyMark'
ERR unknown command '$3'
ERR unknown command '225'
ERR unknown command ' *3'
ERR unknown command '$3'
ERR wrong number of arguments for 'set' command
ERR unknown command '$25'
ERR unknown command 'ESSPrivacyMark'
ERR unknown command '$3'
ERR unknown command '157'
ERR unknown command ' *3'
ERR unknown command '$3'

这种情况一直在继续。输入的形式为

“字符串” \t 计数。

谢谢。

第二次更新：

我使用了流水线，这确实给了我动力。但很快它就耗尽了内存。作为参考，我有一个具有 64 gig RAM 的系统。而且我认为它不会耗尽内存。代码如下：

import redis
import gzip
import os
import sys
r = redis.Redis(host='localhost',port=6379,db=0)
pipe = r.pipeline(transaction=False)
i = 0
MAX = 10000
ignore = ['3gm-0030.gz','3gm-0063.gz','2gm-0008.gz','3gm-0004.gz','3gm-0022.gz','2gm-0019.gz']
for l in os.listdir(sys.argv[1]):
        if(l in ignore):
                continue
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        print 'doing it for file ',l
        for line in infile:
                parts = line.split('\t')
                key = parts[0]
                value = parts[1].rstrip('\n')
                if(i<MAX):
                        pipe.set(key,value)
                        i=i+1
                else:   
                        pipe.execute()
                        i=0
                        pipe.set(key,value)
                        i=i+1
        infile.close()

哈希是要走的路吗？我认为 64 gig 就足够了。我只给了它 20 亿个键值对的一小部分，而不是全部。

score 2 · Accepted Answer

在您的情况下，您想要的可能是不可能的。

根据此页面，您的数据集是 24 GB压缩的gzip。这些文件可能包含很多类似的文本，比如字典。

words对程序中的文件进行快速测试，得到dict3.12 倍的压缩：

> gzip -k -c /usr/share/dict/web2 > words.gz
> du /usr/share/dict/web2  words.gz
2496    /usr/share/dict/web2
800 words.gz
> calc '2496/800'
3.12 /* 3.12 */
> calc '3.12*24'
74.88 /* 7.488e1 */

因此，您的未压缩数据大小很容易超过 64 GB。因此，即使 Redis 没有任何开销，即使您使用 16 位无符号整数来存储计数，它也不适合您的 RAM。

查看示例，大多数键都相对较短；

serve as the incoming   92
serve as the incubator  99
serve as the independent    794
serve as the index  223
serve as the indication 72
serve as the indicator  120
serve as the indicators 45
serve as the indispensable  111
serve as the indispensible  40
serve as the individual 234
serve as the industrial 52

您可以散列密钥，但平均而言它可能不会为您节省太多：

In [1]: from hashlib import md5

In [2]: data = '''serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52'''

In [3]: lines = data.splitlines()

In [4]: kv = [s.rsplit(None, 1) for s in lines]

In [5]: kv[0:2]
Out[5]: [['serve as the incoming', '92'], ['serve as the incubator', '99']]

In [6]: [len(s[0]) for s in kv]
Out[6]: [21, 22, 24, 18, 23, 22, 23, 26, 26, 23, 23]

In [7]: [len(md5(s[0]).digest()) for s in kv]
Out[7]: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]

对于任何短于 16 字节的密钥，它实际上会花费您更多的空间来散列它。

即使您忽略标题，压缩字符串通常也不会节省空间；

In [1]: import zlib

In [2]: zlib.compress('foo')[:3]
Out[2]: 'x\x9cK'

In [3]: zlib.compress('bar')[:3]
Out[3]: 'x\x9cK'

In [4]: s = 'serve as the indispensable'

In [5]: len(s)
Out[5]: 26

In [6]: len(zlib.compress(s))-3
Out[6]: 31

score 0 · Accepted Answer

与其编写命令文件，不如使用流水线和多处理。在 redis-py 中使用流水线非常简单。您需要运行测试以找到理想的块大小。

有关 Py-redis、多处理和流水线的示例，请查看此示例要点

score 0 · Accepted Answer

我肯定会使用散列，因为顶级键有开销，因为它们存储您可能不需要的额外数据（例如 TTL ...）。

redis.io 网站也有一些性能技巧，不久前 Jerremy Zawodny 存储了 12 亿个键/值对。

python - 将 3,795,790,711 个唯一键值对写入 redis

3 回答 3

Related

Reference