algorithm - 根据字符集对单词进行聚类

Question

假设有一个词集，我想根据它们的字符包（多集）对它们进行聚类。例如

{茶，吃，阿爸，阿阿布，你好}

将被聚集成

{{茶，吃}，{abba，aabb}，{你好}}。

abba并且aabb聚集在一起，因为它们具有相同的字符包，即 twoa和 two b。

为了提高效率，我能想到的一种天真的方法是将每个单词转换为 char-cnt 系列，例如，abba将aabb两者转换为a2b2，tea/eat 将转换为a1e1t1. 这样我就可以构建字典并使用相同的键对单词进行分组。

这里有两个问题：首先我必须对字符进行排序以构建密钥；其次，字符串键看起来很别扭，性能不如 char/int 键。

有没有更有效的方法来解决问题？

score 2 · Accepted Answer

为了检测字谜，您可以使用基于素数乘积的散列方案A->2, B->3, C->5等。将给出“abba”==“aabb”== 36（但与素数映射不同的字母会更好）参见我的答案here。

score 1 · Accepted Answer

使用 x 个字符的字母表和 y 的最大字长，您可以创建 (x + y) 位的散列，这样每个字谜都有一个唯一的散列。位的值为 1 表示当前有另一个字母，值为 0 表示移动到下一个字母。这是一个显示其工作原理的示例：

假设我们有一个 7 个字母的字母表 (abcdefg)，最大字长为 4。每个字哈希都是 11 位。让我们对“fade”这个词进行哈希处理：10001010100

第一位是 1，表示有一个礼物。第二位表示没有更多的a。第三位表示没有更多的 b，依此类推。考虑这一点的另一种方法是，一行中的 1 的数量表示该字母的数量，而该字符串之前的总零表示它是哪个字母。

这是“dada”的哈希值：11000110000

值得注意的是，由于可能的哈希和可能的字谜之间存在一一对应关系，因此这是保证为任何输入提供唯一哈希的最小可能哈希，这样就无需在完成后检查存储桶中的所有内容散列。

我很清楚使用大字母和长单词会导致散列大小变大。该解决方案旨在保证唯一的哈希值，以避免比较字符串。如果您可以设计一种算法来在恒定时间内计算此哈希（假设您知道 x 和 y 的值），那么您将能够在 O(n) 中解决整个分组问题。

score 1 · Accepted Answer

Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.

The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).

Iterate over all the words of the list and sort them using our counting sort. Keep a map of sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).

Here is a sample implementation

import java.util.ArrayList;


public class ClusterGen {

    static String sortWord(String w) {
        int freq[] = new int[256];

        for (char c : w.toCharArray()) {
            freq[c]++;
        }
        StringBuilder sortedWord = new StringBuilder();
        //It is at most O(n)
        for (int i = 0; i < freq.length; ++i) {
            for (int j = 0; j < freq[i]; ++j) {
                sortedWord.append((char)i);
            }
        }
        return sortedWord.toString();
    }

    static Map<String, List<String>> cluster(List<String> words) {
        Map<String, List<String>> allClusters = new HashMap<String, List<String>>();

        for (String word : words) {
            String sortedWord = sortWord(word);
            List<String> cluster = allClusters.get(sortedWord);
            if (cluster == null) {
                cluster = new ArrayList<String>();
            }
            cluster.add(word);
            allClusters.put(sortedWord, cluster);
        }

        return allClusters;
    }

    public static void main(String[] args) {
        System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
        System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));

    }
}

Returns

{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}

score 0 · Accepted Answer

Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).

score 0 · Accepted Answer

26 位整数作为散列函数

如果您的字母表不是太大，例如，只是小写英文字母，您可以为每个单词定义这个特定的哈希函数：一个 26 位整数，其中每个位表示该英文字母是否存在于单词中。请注意，具有相同字符集的两个单词将具有相同的哈希值。

然后只需将它们添加到哈希表中。它将自动通过哈希冲突进行聚类。

计算哈希需要花费O(max length of the word)时间，并且插入哈希表是常数时间。所以整体复杂度是O(max length of a word * number of words)

score 0 · Accepted Answer

我将分两步执行此操作，首先根据它们的长度对所有单词进行排序，然后分别处理每个子集（这是为了避免以后出现大量重叠。）

下一步更难，有很多方法可以做到。最简单的方法之一是为每个字母分配一个数字（例如 a = 1、b = 2 等）并将每个单词的所有值相加，从而将每个单词分配给一个整数。然后你可以根据这个整数值对单词进行排序，这大大减少了你必须比较的数字。

根据您的数据集，您可能仍然有很多重叠（“bad”和“cac”会生成相同的整数散列），因此您可能需要设置一个阈值，如果一个桶中有太多单词，您可以重复前一个使用另一个哈希（只是为字母分配不同的数字）除非有人查看了您的代码并设计了一个单词列表来搞乱您，否则这应该将重叠减少到几乎没有。

请记住，当您期望少量单词在同一个字符包中时，这种方法会很有效。如果您的数据是很多只进入几个字符包的长词，那么您在最后一步中进行的比较次数将是天文数字，在这种情况下，您最好使用您描述的方法- 没有可能的重叠。

score 0 · Accepted Answer

我做过的一件事与此类似，但允许发生冲突，即对字母进行排序，然后删除重复项。因此，在您的示例中，您将拥有“aet”、“ab”和“ehlo”的存储桶。

现在，正如我所说，这允许碰撞。所以“杆”和“门”最终都在同一个桶中，这可能不是你想要的。但是，碰撞将是一个易于快速搜索的小集合。

因此，一旦您有了存储桶的字符串，您就会注意到您可以将其转换为 32 位整数（至少对于 ASCII 而言）。字符串中的每个字母都变成 32 位整数中的一个位。所以“a”是第一位，“b”是第二位，依此类推。所有（英文）单词构成一个具有 26 位标识符的存储桶。然后，您可以进行非常快速的整数比较以找到新单词进入的存储桶，或者找到现有单词所在的存储桶。

algorithm - 根据字符集对单词进行聚类

7 回答 7

Related

Reference