algorithm - 大量可能性的最佳搜索技术

Question

我正在尝试在文本文件中搜索大量可能性。

例如，我想搜索一个包含唯一名称的文本文件。现在，如果我找到名称 X，那么我想将 X 存储在另一个文件中。

这里的问题是我有超过 1000 个唯一名称，我不想为每个唯一名称执行 1000 个搜索调用和 if 语句。

在 java/javascript/php 中有没有更好的方法呢？

score 2 · Accepted Answer

您有一组名称，并且您想查找哪些名称与另一组名称匹配。

Set<String> namesFromFile = readFile(filename);
Set<String> namesToMatch = readFile(matchingNames);
namesToMatch.retainAll(namedFromFile);

这retainAll是一个 O(n) 操作，其中n是较小集合的大小。在 JavaretainAll中，一组 1000 个值可能需要几毫秒。

Set.retainAll()执行以下操作

仅保留此集合中包含在指定集合中的元素（可选操作）。换句话说，从这个集合中移除所有不包含在指定集合中的元素。如果指定的集合也是一个集合，则此操作有效地修改此集合，使其值是两个集合的交集。

1000 个元素的集合太小了，很难准确测试，所以在这个测试中，我测试一个大 10 倍的元素，即 10,000 个元素与 100,000 个元素的集合。

public static void main(String... args) {
    Set<String> names1 = generateStrings(100000, 2);
    Set<String> names2 = generateStrings(10000, 3);
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        Set<String> intersect= new HashSet<String>(names2);
        intersect.retainAll(names1);
        long time = System.nanoTime() - start;
        System.out.printf("The intersect of %,d and %,d elements has %,d and took %.3f ms%n",
                names1.size(), names2.size(), intersect.size(), time / 1e6);
    }
}

private static Set<String> generateStrings(int number, int multiple) {
    Set<String> set = new HashSet<String>();
    for (int i = 0; i < number; i++)
        set.add(Integer.toBinaryString(i * multiple));
    return set;
}

印刷

The intersect of 100,000 and 10,000 elements has 5,000 and took 21.173 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 10.785 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 9.597 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 3.414 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.791 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.629 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.689 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.753 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.704 ms
The intersect of 100,000 and 10,000 elements has 5,000 and took 2.645 ms

algorithm - 大量可能性的最佳搜索技术

1 回答 1

Related

Reference