java - Java algorithm for evenly distributing ranges of strings into buckets

Question

Short version - I'm looking for a Java algorithm that given a String and an integer representing a number of buckets returns which bucket to place the String into.

Long version - I need to distribute a large number of objects into bins, evenly (or approximately evenly). The number of bins/buckets will vary, so the algorithm can't assume a particular number of bins. It may be 1, 30, or 200. The key for these objects will be a String.

The String has some predictable qualities that are important. The first 2 characters of the string actually appear to be a hex representation of a byte. i.e. 00-ff , and the strings themselves are quite evenly distributed within that range. There are a couple of outliers that start differently though, so this can't be relied on 100% (though easily 99.999%). This just means that edge cases do need to be handled.

It's critical that once all the strings have been distributed that there is zero overlap in range between what values appear in any 2 bins. So, that if I know what range of values appear in a bin, I don't have to look in any other bins to find the object. So for example, if I had 2 bins, it could be that bin 0 has Strings starting with letters a-m and bin 1 starting with n-z. However, that wouldn't satisfy the need for even distribution given what we know about the Strings.

Lastly, the implementation can have no knowledge of the current state of the bins. The method signature should literally be:

public int determineBucketIndex(String key, int numBuckets);

I believe that the foreknowledge about the distribution of the Strings should be sufficient.

EDIT: Clarifying for some questions Number of buckets can exceed 256. The strings do contain additional characters after the first 2, so this can be leveraged.

The buckets should hold a range of Strings to enable fast lookup later. In fact, that's why they're being binned to begin with. With only the knowledge of ranges, I should be able to look in exactly 1 bucket to see if the value is there or not. I shouldn't have to look in others.

Hashcodes won't work. I need the buckets to contain only String within a certain range of the String value (not the hash). Hashing would lose that.

EDIT 2: Apparently not communicating well. After bins have been chosen, these values are written out to files. 1 file per bin. The system that uses these files after binning is NOT Java. It's already implemented, and it needs values in the bins that fit within a range. I repeat, hashcode will not work. I explicitly said the ranges for strings cannot overlap between two bins, using hashcode cannot work.

score 1 · Accepted Answer

我已经阅读了你的问题两次，但我仍然不明白这些限制。因此，我在这里提出一个建议，您可以对此提供反馈。如果这不起作用，请解释原因。

首先，对 bin 的数量进行一些数学运算，以确定唯一 bin 编号需要多少位。取二进制数的以 2 为底的对数，然后取位数除以 8 的上限。这是您需要的数据字节数numBytes。

取前两个字母并将它们转换为一个字节。然后抓取numBytes - 1字符并将它们转换为字节。取字符的序数值（'A'变为 65，依此类推）。如果下一个字符可能是 Unicode，请选择一些规则将它们转换为字节......可能会获取最低有效字节（模数为 256）。获取numBytes字节总数，包括由前两个字母组成的字节，并转换为整数。使前两个字母的字节为整数的最低 8 位，下一个字节为接下来的 8 位，依此类推。现在只需将该值的模数乘以箱数，您就有一个整数箱号。

如果字符串太短并且没有更多字符可以转换为字节值，请使用0每个缺失的字符。

如果有任何可预测的字符（例如，第三个字符始终是空格），则不要使用这些字符；跳过他们。

现在，如果这对您不起作用，请解释原因，然后也许我们会很好地理解这个问题来回答它。

score 0 · Accepted Answer

对原始帖子进行 2 次更新后编辑的答案

从一开始就在您的问题中包含所有信息将是一个好主意 - 通过您的新编辑，您的描述已经为您提供了答案：将您的对象放入平衡树中（为您提供所需的同质分布）基于您的字符串的 hashCodesubstring(0,2)或类似的基于头部的东西。然后将 BTree 中的每个叶子（作为一组字符串）写入文件。

score 0 · Accepted Answer

我严重怀疑所描述的问题是否可以完美解决。这个怎么样：

创建 257 个垃圾箱。
将所有普通字符串放入 bin 0-255。
将所有异常值放入 bin 256。

除了“平均分配”之外，这不符合您的所有要求吗？

此时，如果你真的想要更均匀的分布，你可以将 0-255 的 bin 重新组织成更少的更均匀分布的 bin。但我认为你可能只需要减少那里的要求。

java - Java algorithm for evenly distributing ranges of strings into buckets

3 回答 3

Related

Reference