python - 足够安全的 8 字符短唯一随机字符串

Question

我正在尝试为数千个没有可能发生名称冲突的文件计算 8 个字符的短唯一随机文件名。这种方法足够安全吗？

base64.urlsafe_b64encode(hashlib.md5(os.urandom(128)).digest())[:8]

编辑

为了更清楚起见，我正在尝试对上传到存储的文件名进行最简单的混淆。

我发现 8 个字符的字符串，足够随机，如果实施得当，将是存储数万个文件而不会发生冲突的非常有效和简单的方法。我不需要保证唯一性，只需要足够高的名称冲突可能性（仅谈论数千个名称）。

文件存储在并发环境中，因此增加共享计数器是可以实现的，但很复杂。在数据库中存储计数器效率低下。

我还面临这样一个事实，即 random() 在某些情况下会在不同的进程中返回相同的伪随机序列。

score 59 · Accepted Answer

您当前的方法应该足够安全，但您也可以查看该uuid模块。例如

import uuid

print str(uuid.uuid4())[:8]

输出：

ef21b9ad

score 42 · Accepted Answer

哪种方法碰撞更少、更快、更容易阅读？

TLDR

这random_choice是最快的，碰撞较少，但 IMO 略难阅读。

最易读的是shortuuid_random外部依赖，速度稍慢，碰撞次数是 6 倍。

方法


alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)

def random_choice():
    return ''.join(random.choices(alphabet, k=8))

def truncated_uuid4():
    return str(uuid.uuid4())[:8]

def shortuuid_random():
    return su.random(length=8)

def secrets_random_choice():
    return ''.join(secrets.choice(alphabet) for _ in range(8))

结果

abcdefghijklmnopqrstuvwxyz0123456789所有方法都从字母表中生成 8 个字符的 UUID 。碰撞是从单次运行中计算的，有 1000 万次平局。时间以秒为单位报告为平均函数执行 ± 标准偏差，均计算 1,000 次绘制的 100 次运行。总时间是碰撞测试的总执行时间。

random_choice: collisions 22 - time (s) 0.00229 ± 0.00016 - total (s) 29.70518
truncated_uuid4: collisions 11711 - time (s) 0.00439 ± 0.00021 - total (s) 54.03649
shortuuid_random: collisions 124 - time (s) 0.00482 ± 0.00029 - total (s) 51.19624
secrets_random_choice: collisions 15 - time (s) 0.02113 ± 0.00072 - total (s) 228.23106

笔记

默认shortuuid字母表具有大写字符，因此产生的冲突更少。为了进行公平比较，我们需要选择与其他方法相同的字母表。
这些secrets方法虽然可能更快token_hex，token_urlsafe但具有不同的字母表，因此不符合比较条件。
和alphabet基于类的shortuuid方法被分解为模块变量，从而加快了方法的执行速度。这不应影响 TLDR。

完整的测试细节

import random
import secrets
from statistics import mean
from statistics import stdev
import string
import time
import timeit
import uuid

import shortuuid


alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)


def random_choice():
    return ''.join(random.choices(alphabet, k=8))


def truncated_uuid4():
    return str(uuid.uuid4())[:8]


def shortuuid_random():
    return su.random(length=8)


def secrets_random_choice():
    return ''.join(secrets.choice(alphabet) for _ in range(8))


def test_collisions(fun):
    out = set()
    count = 0
    for _ in range(10_000_000):
        new = fun()
        if new in out:
            count += 1
        else:
            out.add(new)
    return count


def run_and_print_results(fun):
    round_digits = 5
    now = time.time()
    collisions = test_collisions(fun)
    total_time = round(time.time() - now, round_digits)

    trials = 1_000
    runs = 100
    func_time = timeit.repeat(fun, repeat=runs, number=trials)
    avg = round(mean(func_time), round_digits)
    std = round(stdev(func_time), round_digits)

    print(f'{fun.__name__}: collisions {collisions} - '
          f'time (s) {avg} ± {std} - '
          f'total (s) {total_time}')


if __name__ == '__main__':
    run_and_print_results(random_choice)
    run_and_print_results(truncated_uuid4)
    run_and_print_results(shortuuid_random)
    run_and_print_results(secrets_random_choice)

score 27 · Accepted Answer

您是否有理由不能tempfile用来生成名称？

mkstemp像和这样的函数NamedTemporaryFile绝对保证给你唯一的名字；基于随机字节的任何东西都不会给你。

如果由于某种原因您实际上还不想创建文件（例如，您正在生成要在某个远程服务器或其他东西上使用的文件名），那么您就不能完全安全，但mktemp仍然比随机名称更安全。

或者只是将一个 48 位计数器存储在某个“足够全局”的位置，这样您就可以保证在发生冲突之前经历完整的名称循环，并且您还可以保证知道何时会发生冲突。

它们都比阅读urandom和做md5.

如果你真的想生成随机名称，''.join(random.choice(my_charset) for _ in range(8))也将比你正在做的更简单，更高效。Evenurlsafe_b64encode(os.urandom(6))和 MD5 哈希一样随机，而且更简单、更高效。

密码随机性和/或密码散列函数的唯一好处是避免可预测性。如果这对您来说不是问题，为什么要为此付费？如果你确实需要避免可预测性，你几乎肯定需要避免比赛和其他更简单的攻击，所以避免mkstemporNamedTemporaryFile是一个非常糟糕的主意。

更不用说，正如 Root 在评论中指出的那样，如果您需要安全性，MD5 实际上并没有提供它。

score 27 · Accepted Answer

你可以试试shortuuid库。

安装：pip install shortuuid

然后它很简单：

> import shortuuid
> shortuuid.uuid()
'vytxeTZskVKR7C7WgdSP3d'

score 6 · Accepted Answer

从 Python 3.6 开始，您可能应该使用该secrets模块。secrets.token_urlsafe()似乎很适合您的情况，并且保证使用加密安全的随机源。

score 2 · Accepted Answer

我正在使用hashids将时间戳转换为唯一 ID。（如果需要，您甚至可以将其转换回时间戳）。

这样做的缺点是如果你创建 id 太快，你会得到一个副本。但是，如果您在中间生成它们，那么这是一个选项。

这是一个例子：

from hashids import Hashids
from datetime import datetime
hashids = Hashids(salt = "lorem ipsum dolor sit amet", alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
print(hashids.encode(int(datetime.today().timestamp()))) #'QJW60PJ1' when I ran it

score 1 · Accepted Answer

你可以试试这个

import random
uid_chars = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
             'v', 'w', 'x', 'y', 'z','1','2','3','4','5','6','7','8','9','0')
uid_length=8
def short_uid():
    count=len(uid_chars)-1
    c=''
    for i in range(0,uid_length):
        c+=uid_chars[random.randint(0,count)]
    return c

例如：

print short_uid()
nogbomcv

score 1 · Accepted Answer

最快的确定性方法

import random
import binascii
e = random.Random(seed)
binascii.b2a_base64(random.getrandbits(48).to_bytes(6, 'little'), newline=False)

最快的系统随机方法

import os
import binascii
binascii.b2a_base64(os.urandom(6), newline=False)

网址安全方法

利用os.urandom

import os
import base64
base64.urlsafe_b64encode(os.urandom(6)).decode()

使用random.Random.choices（缓慢，但灵活）

import random
import string
alphabet = string.ascii_letters + string.digits + '-_'
''.join(random.choices(alphabet, k=8))

使用random.Random.getrandbits（快于random.Random.randbytes）

import random
import base64
base64.urlsafe_b64encode(random.getrandbits(48).to_bytes(6, 'little')).decode()

使用random.Random.randbytes(python >= 3.9)

import random
import base64
base64.urlsafe_b64encode(random.randbytes(6)).decode()

使用random.SystemRandom.randbytes(python >= 3.9)

import random
import base64
e = random.SystemRandom()
base64.urlsafe_b64encode(e.randbytes(6)).decode()

random.SystemRandom.getrandbits如果 python >= 3.9，则不建议使用，因为它需要 2.5 倍的时间，random.SystemRandom.randbytes并且更复杂。

使用secrets.token_bytes(python >= 3.6)

import secrets
import base64
base64.urlsafe_b64encode(secrets.token_bytes(6)).decode()

使用secrets.token_urlsafe(python >= 3.6)

import secrets
secrets.token_urlsafe(6) # 6 byte base64 has 8 char

进一步讨论

python3.9中的secrets.token_urlsafe实现

tok = token_bytes(nbytes)
base64.urlsafe_b64encode(tok).rstrip(b'=').decode('ascii')

由于 ASCII 字节.decode()比快.decode('ascii')，并且.rstrip(b'=')在nbytes % 6 == 0.

base64.urlsafe_b64encode(secrets.token_bytes(nbytes)).decode()更快（~20%）。

在 Windows10 上，基于字节的方法在 nbytes=6(8 char) 时快 2 倍，在 nbytes=24(32 char) 时快 5 倍。

在 Windows 10（我的笔记本电脑）上，secrets.token_bytes花费类似的时间random.Random.randbytes，并且base64.urlsafe_b64encode比随机字节生成花费更多的时间。

在 Ubuntu 20.04（我的云服务器，可能缺少熵）上，secrets.token_bytes花费的时间比多 15 倍random.Random.randbytes，但花费的时间类似random.SystemRandom.randbytes

由于secrets.token_bytes使用random.SystemRandom.randbytesuse os.urandom（因此它们完全相同），如果性能至关重要，您可以替换secrets.token_bytes为。os.urandom

在 Python3.9 中，base64.urlsafe_b64encode是和的组合base64.b64encode，bytes.translate因此要多花约 30% 的时间。

random.Random.randbytes(n)由实现random.Random.getrandbits(n * 8).to_bytes(n, 'little')，因此慢 3 倍。（但是，random.SystemRandom.getrandbits用实现random.SystemRandom.randbytes）

base64.b32encode速度要慢得多（6 字节为 5 倍，480 字节为 17 倍），base64.b64encode因为其中有很多 python 代码base64.b32encode，但base64.b64encode只需调用binascii.b2a_base64（C 实现）。

不过里面有个python分支语句，if altchars is not None:在base64.b64encode处理小数据时会引入不可忽略的开销，binascii.b2a_base64(data, newline=False)可能会更好。