只是为了在上面的问题中添加一些硬数字,我已经实现了一个小程序来按照问题中提到的方式生成 id,这些是其中一次试运行的结果:
Length Count MD5 Base 62
4 400 3d0e 446
5 925 4fc04 1Myi
6 2434 0e9368 40V6
7 29155 e406d96 GBFiA
8 130615 7ba787c8 2GOiCm
9 403040 75525d163 YNKjL9
10 1302992 e1b3581f52 H47JAIs
Total: 1869561
每次哈希的长度增加时,这是由于碰撞,因此在这种情况下发生了六次碰撞。计数是在冲突之前生成的给定长度的键的数量。MD5 列显示在出现重复键错误之前成功添加的最后一个截断键。最右边的列显示以 62 为基数的键(如果我已正确完成转换)。
看起来每个额外的字符都会生成越来越多的键,这是您可以想象的。我希望这种方法能够适用于用户生成的内容。
对于任何感兴趣的人,以下是我如何获得最后一列的 base 62 表示:
def base_62_encode(input):
"Inspired by code at http://en.wikipedia.org/wiki/Base_36."
CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
rv = ""
while input != 0:
rv = CLIST[input % 62] + str(rv)
input /= 62
return rv
base62_id = base_62_encode(long(truncated_hash, 16))
(后来补充:)
这是一个在 Google App Engine 上下文中处理与这些 ID 生成相关的几件事的类。默认情况下,此类生成的密钥不是纯 62 进制的,因为 GAE 密钥名称的第一个字符必须是字母。该要求已在此处通过使用基数 52 作为第一个字符来解决。
通过更改已传递到(或从)构造函数中省略的“clist”值,可以将主基数更改为 62 以外的值。可能想要删除容易混淆的字符,例如“1”、“l”、“i”等。
用法:
keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5)
small_id, modified = keygen.small_id(seed_value_from_sharded_counter)
这是课程:
class LengthBackoffIdGenerator(object):
"""Class to generate ids along the lines of tinyurl.
By default, ids begin with a alphabetic character. Each id that is created is checked against previous ones
to ensure uniqueness. When a duplicate is generated, the length of the seed value used is increased by one
character.
Ids become gradually longer over time while remaining relatively short, and there are very few retries in
relation to the number of keys generated.
"""
ids = set()
def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False,
initial_length=5, check_db=False):
self.clist = clist
self.alpha_first = alpha_first
if self.alpha_first:
import re
alpha_list = re.sub(r'\d', '', clist)
if len(alpha_list) < 1:
raise Exception, 'At least one non-numeric character is required.'
self.alpha_list = alpha_list
self.cls = cls
self.length = initial_length
self.check_db = check_db
def divide_long_and_convert(self, radix, clist, input, rv):
"Inspired by code at http://en.wikipedia.org/wiki/Base_36."
rv += clist[input % radix]
input /= radix
return (input, rv)
def convert_long(self, input):
rv = ""
if self.alpha_first:
input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv)
while input != 0:
input, rv = self.divide_long_and_convert(len(self.clist), self.clist, input, rv)
return rv
def hash_truncate_and_binify(self, input, length):
"""Compute the MD5 hash of an input string, truncate it and convert it to a long.
input - A seed value.
length - The length to truncate the MD5 hash to.
"""
from assessment.core.functions import md5_hash
input = md5_hash(input)[0:length]
return long(input, 16)
def _small_id(self, input):
"""Create an ID that looks similar to a tinyurl ID.
The first letter of the ID will be non-numeric by default.
"""
return self.convert_long(self.hash_truncate_and_binify(input, self.length))
def small_id(self, seed):
key_name = self._small_id(seed)
increased_length = False
if self.check_db and not self.cls.get_by_key_name(key_name) is None:
self.ids.add(key_name)
if key_name in self.ids:
self.length += 1
increased_length = True
key_name = self._small_id(seed)
self.ids.add(key_name)
return (key_name, increased_length)