2

我试图找到三个不同大小的哈希集的交集。通过更改集合相交的顺序,可以找到相交的速度是否有任何差异。示例程序如下:

public class RetainTest {
    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();

    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      

    public static void main(String[] args){
        preamble()
        large.retainAll(medium);
        large.retainAll(small);

        System.out.println(large.size());
    }


    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();

        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }

        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }

    }

}
4

2 回答 2

3

分析表明,组合多个集合的最快方法是将retainAll较大的集合组合成较小的集合。此外,这些保留的顺序也应该从最小到最大。所以

    small.retainAll(medium);
    small.retainAll(large);

分析表明差异很大:对于这个数据集,最慢的顺序大约是最慢顺序的 10 倍

在此处输入图像描述

测试程序

这些结果是使用以下测试程序创建的,该程序运行 20 分钟

public class RetainTest {
    
    static Set<Integer> large =new HashSet<>();
    static Set<Integer> medium =new HashSet<>();
    static Set<Integer> small =new HashSet<>();
    
    static int largeSize=10000;
    static int midSize=5000;
    static int smallSize=1000;      
    
    public static void main(String[] args){
        while(true){
            preamble();
            int size1=largeMediumSmall().size();
            preamble();
            int size2=largeSmallMedium().size();
            preamble();
            int size3=smallMediumLarge().size();
            preamble();
            int size4=smallLargeMedium().size();
            preamble();
            int size5=mediumSmallLarge().size();
            preamble();
            int size6=mediumLargeSmall().size();
            
            //sanity check + ensuring the JIT can't optimise out
            if (size1!=size2 || size1!=size3 || size1!=size4 || size1!=size5 || size1!=size6){
                System.out.println("bad");
            }
        }
        

    }
    
    public static Set<Integer> largeMediumSmall(){
        large.retainAll(medium);
        large.retainAll(small);
        
        return large;
    }
    
    public static Set<Integer> smallMediumLarge(){
        small.retainAll(medium);
        small.retainAll(large);
        
        return small;
    }
    public static Set<Integer> smallLargeMedium(){
        small.retainAll(large);
        small.retainAll(medium);
        
        return small;
    }
    public static Set<Integer> mediumSmallLarge(){
        medium.retainAll(small);
        medium.retainAll(large);
        
        return medium;
    }
    public static Set<Integer> mediumLargeSmall(){
        medium.retainAll(large);
        medium.retainAll(small);
        
        return medium;
    }
    public static Set<Integer> largeSmallMedium(){
        large.retainAll(small);
        large.retainAll(medium);
        
        return large;
    }
    
    
    public static void preamble(){
        large =new HashSet<>();
        medium =new HashSet<>();
        small =new HashSet<>();
        
        Random rnd=new Random(15);
        for(int i=0;i<largeSize;i++){
            large.add(rnd.nextInt(largeSize*10));
        }
        
        for(int i=0;i<midSize;i++){
            medium.add(rnd.nextInt(largeSize*10));
        }
        for(int i=0;i<smallSize;i++){
            small.add(rnd.nextInt(largeSize*10));
        }
        
    }
    
}
于 2014-01-29T12:34:15.397 回答
2

哈希集的查询成本不取决于集合的大小。是通过对查询进行setA.retainAll(setB)迭代(参见 的实现)。此操作的总成本线性取决于 的大小。因此,您应该始终遍历最少的集合:setAsetBAbstractCollection.retainAll()setA

small.retainAll(medium);
small.retainAll(large);

Richard Tingle 的基准测试证明了这一点。
编辑啊,Richard Tingle 也是这个问题的作者 :)

如果您只有三组并且性能确实很重要,请尝试在单次迭代中找到交集:

Iterator<E> it = small.iterator();
while (it.hasNext()) {
    E e = it.next();
    if (!medium.contains(e) || !large.contains(e))
        it.remove();
}

从 Java 8 开始:

small.removeIf(e -> !medium.contains(e) || !large.contains(e));
于 2014-01-29T18:23:48.767 回答