python - 压缩 json 以存储在 redis 或 memcache 等基于内存的存储中的最佳方法是哪种？

Question

要求：具有 2-3 级嵌套的 Python 对象，其中包含整数、字符串、列表和字典等基本数据类型。（没有日期等），需要在 redis 中针对密钥存储为 json。将 json 压缩为字符串以降低内存占用的最佳方法是什么。目标对象不是很大，平均有 1000 个小元素，转换为 JSON 后大约有 15000 个字符。

例如。

>>> my_dict
{'details': {'1': {'age': 13, 'name': 'dhruv'}, '2': {'age': 15, 'name': 'Matt'}}, 'members': ['1', '2']}
>>> json.dumps(my_dict)
'{"details": {"1": {"age": 13, "name": "dhruv"}, "2": {"age": 15, "name": "Matt"}}, "members": ["1", "2"]}'
### SOME BASIC COMPACTION ###
>>> json.dumps(my_dict, separators=(',',':'))
'{"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}'

1/有没有其他更好的方法来压缩json以节省redis中的内存（也确保之后的轻量级解码）。

2/ msgpack [http://msgpack.org/] 的候选人有多好？

3/ 我也应该考虑泡菜之类的选择吗？

score 19 · Accepted Answer

我们只是gzip用作压缩机。

import gzip
import cStringIO

def decompressStringToFile(value, outputFile):
  """
  decompress the given string value (which must be valid compressed gzip
  data) and write the result in the given open file.
  """
  stream = cStringIO.StringIO(value)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      outputFile.close()
      return 
    outputFile.write(chunk)

def compressFileToString(inputFile):
  """
  read the given open file, compress the data and return it as string.
  """
  stream = cStringIO.StringIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = inputFile.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

如您所想，在我们的用例中，我们将结果存储为文件。要仅使用内存中的字符串，您也可以使用cStringIO.StringIO()对象作为文件的替换。

score 10 · Accepted Answer

根据上面@Alfe 的回答，这是一个将内容保存在内存中的版本（用于网络 I/O 任务）。我还进行了一些更改以支持 Python 3。

import gzip
from io import StringIO, BytesIO

def decompressBytesToString(inputBytes):
  """
  decompress the given byte array (which must be valid 
  compressed gzip data) and return the decoded text (utf-8).
  """
  bio = BytesIO()
  stream = BytesIO(inputBytes)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      bio.seek(0)
      return bio.read().decode("utf-8")
    bio.write(chunk)
  return None

def compressStringToBytes(inputString):
  """
  read the given string, encode it in utf-8,
  compress the data and return it as a byte array.
  """
  bio = BytesIO()
  bio.write(inputString.encode("utf-8"))
  bio.seek(0)
  stream = BytesIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = bio.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

要测试压缩尝试：

inputString="asdf" * 1000
len(inputString)
len(compressStringToBytes(inputString))
decompressBytesToString(compressStringToBytes(inputString))

score 4 · Accepted Answer

我在不同的二进制格式（MessagePack、BSON、Ion、Smile CBOR）和压缩算法（Brotli、Gzip、XZ、Zstandard、bzip2）之间做了一些广泛的比较。

对于我用于测试的 JSON 数据，将数据保留为 JSON 并使用 Brotli 压缩是最好的解决方案。Brotli 具有不同的压缩级别，因此如果您要长时间保留数据，那么使用高级别压缩是值得的。如果您不坚持很长时间，那么较低级别的压缩或使用 Zstandard 可能是最有效的。

Gzip 很简单，但几乎可以肯定会有更快的替代方案，或提供更好的压缩，或两者兼而有之。

您可以在此处阅读我们调查的全部详细信息：博客文章

score 3 · Accepted Answer

如果您希望它更快，请尝试 lz4。如果您希望它更好地压缩，请选择 lzma。

有没有其他更好的方法来压缩 json 以节省 redis 中的内存（也确保之后的轻量级解码）？

msgpack [ http://msgpack.org/]的候选人有多好？

Msgpack 速度相对较快，内存占用也较小。但是ujson通常对我来说更快。您应该在数据上比较它们，测量压缩和解压缩率以及压缩率。

我也应该考虑泡菜之类的选择吗？

考虑泡菜（特别是cPickle）和元帅。他们很快。但请记住，它们不安全或可扩展，您需要为速度付出额外的责任。

score 2 · Accepted Answer

一种简单的“后处理”方法是构建一个“短键名”映射并在存储之前通过该映射运行生成的 json，并在反序列化为对象之前再次（反转）。例如：

Before: {"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}
Map: details:d, age:a, name:n, members:m
Result: {"d":{"1":{"a":13,"n":"dhruv"},"2":{"a":15,"n":"Matt"}},"m":["1","2"]}

只需通过json并在通往数据库的路上替换key->value，在通往应用程序的路上替换value->key。

您还可以 gzip 以获得额外的好处（但在那之后不会是字符串）。

score 0 · Accepted Answer

另一种可能性是使用 MongoDB 的存储格式BSON。

您可以在该站点的实现页面中找到两个 python 实现。

编辑：为什么不保存字典，并在检索时转换为 json？

python - 压缩 json 以存储在 redis 或 memcache 等基于内存的存储中的最佳方法是哪种？

6 回答 6

Related

Reference