java - 在一个大文件中找到两个具有相同哈希码的单词 - 奇怪的输出

Question

我试图/usr/share/dict/words在 Ubuntu 12.04 上找到两个具有相同哈希码的单词。

试图保持Map<Integer, HashSet<String>>。

读完单词后计算他的哈希码h并将单词放入键为的集合中h。

然后遍历所有键并打印大小> 1的集合。

但是我在运行后看到了非常奇怪的输出。

代码：

public static void main(String[] args) throws FileNotFoundException {
        HashSet<String> fileWords = new HashSet<>();
        Map<Integer, HashSet<String>> duplicats = new HashMap<>();
        Scanner scan = new Scanner(new File("/usr/share/dict/words"));

        while (scan.hasNext()) {
            String word = scan.nextLine();
            int h = word.hashCode();
            fileWords.add(word);
            duplicats.put(new Integer(h), fileWords);
        }

        Set<Integer> keySet = duplicats.keySet();
        for (Integer key : keySet) {
            HashSet<String> value = duplicats.get(key);
            if (value.size() > 1) {
                System.out.println(key + " : " + value.toString());
            }
        }
    }

输出：

21917608 : [repaying, Zubenelgenubi, treason, indignation, eyetooth, ....// a lot of words

它看起来很奇怪。我不知道出了什么问题？

更新：

我找到了解决方案：

public static void main(String[] args) throws FileNotFoundException {
    Map<Integer, HashSet<String>> duplicats = new HashMap<>();
    Scanner scan = new Scanner(new File("/usr/share/dict/words"));

    while (scan.hasNext()) {
        String word = scan.nextLine();
        int h = word.hashCode();

        if (!duplicats.containsKey(h)) 
        {
            HashSet<String> newSet = new HashSet<>();
            newSet.add(word);
            duplicats.put(new Integer(h), newSet);
        } 
        else 
        {
            duplicats.get(h).add(word);
        }
    } /// rest the same

如何解决这个麻烦？

score 1 · Accepted Answer

HashSet<String> fileWords = new HashSet<>();

您只实例化一个集合并将所有单词添加到其中。

您必须添加以下逻辑：

检查您当前的哈希码键下是否已经有一个集合；
如果有，只需添加单词即可；
如果没有，则创建一个新集合，添加单词并将其放入地图中。

按照现在的方式，您将相同的集合放在所有地图键下。

score 1 · Accepted Answer

我不太了解您的代码的目的，但是duplicats您将每个映射到文件（）中hashCode的所有 s 的集合。然后显示它。以下代码按预期工作。StringfileWords

public static void main(String[] args) throws FileNotFoundException {

    Map<Integer,HashSet<String>> duplicats= new HashMap<Integer, HashSet<String>>() ;
    Scanner scan = new Scanner(new File("C:\\Downloads\\Software\\sourceforge.net\\souptonuts\\dictionary\\linuxwords.1\\linux.words"));

    while( scan.hasNext() ) {
        String word= scan.nextLine() ;
        int hc= new Integer( word.hashCode() ) ;
        HashSet<String> count= duplicats.get( hc ) ;
        if( count == null ) {
            count= new HashSet<String>() ;
            duplicats.put(hc, count ) ;
        }
        count.add( word );
    }

    int nonCollisionHashCodes= 0 ;
    int singleCollisionHashCodes= 0 ;
    int doubleCollisionHashCodes= 0 ;
    for(Entry<Integer, HashSet<String>> e : duplicats.entrySet() ) {
        if( e.getValue().size() <= 1 ) {
            nonCollisionHashCodes++;
        } else if( e.getValue().size() <= 2 ) {
            singleCollisionHashCodes++;
        } else if( e.getValue().size() <= 3 ) {
            doubleCollisionHashCodes++;
        } else {
            System.out.println(e.getKey() + " : " + e.getValue().size());
        }
    }
    System.out.println("Number of non-collision hashCodes: "+ nonCollisionHashCodes );
    System.out.println("Number of single-collision hashCodes: "+ singleCollisionHashCodes );
    System.out.println("Number of double-collision hashCodes: "+ doubleCollisionHashCodes );
}

至少对于我的字典，输出是：

Number of non-collision hashCodes: 626167
Number of single-collision hashCodes: 885
Number of double-collision hashCodes: 6

请注意，没有超过双重冲突哈希码的输出。

根据我的口味，这些统计数据非常好。用你的字典试试并发布你的结果。

java - 在一个大文件中找到两个具有相同哈希码的单词 - 奇怪的输出

2 回答 2

Related

Reference