java - Java : Datastructure to stock lots of words

Question

I have to stock lots of word (+200k) in a Java program and I want to access them really fast. I just need to know if a given word belongs to my "dictionary". I don't need a pair like <word, smthg>. If possible I'm searching a solution in the standard library.

PS : Maybe using a data structure is not the better way to do this ? Reading each time the file containing the words will be more efficient ?

edit : It's a small project. I have to deal with effectiveness and the memory

Last Edit : I finally choose HashSet.

score 5 · Accepted Answer

使用 java Sets 因为集合是线性排序的数据结构，如 TreeSet。因此，对于搜索，可以实现像二分搜索这样的技术，而且它们速度很快，没有重复。

这是一个java Sets的结构。

在此处输入图像描述

此外，它不会允许重复，从而减少冗余并节省您的内存。

如果您想了解各种搜索算法的复杂性，请参阅此链接。这是

http://bigocheatsheet.com/

score 3 · Accepted Answer

根据单词的分布使用Trie或Patricia 树。我个人会选择 Patricia 树，因为它更适合内存使用（尽管它更难实现）。

score 0 · Accepted Answer

也许您想测试我的TrieMap或TrieSet实现（在此处找到）？我专门为这种情况编写了它们。到目前为止，我已经实现了尝试String和byte[]键。

    TrieSet<String> t = Tries.newStringTrieSet();

    t.add("hello");
    t.add("help");
    t.add("hell");
    t.add("helmet");
    t.add("hemp");

    List<String> resultsA = new ArrayList<>();
    t.findElements("hel", true, resultsA);    // search for prefix

    List<String> resultsB = new ArrayList<>();
    t.findElements("ell", false, resultsB);   // search for substring

    System.out.println("A: " + resultsA);
    System.out.println("B: " + resultsB);

这将打印：

A: [hell, hello, helmet, help]
B: [hell, hello]

score 0 · Accepted Answer

这对我来说看起来很不错，我不知道我是否因为某种原因错了：

//put all your words to an ArrayList and sort the list.
List <String> arr = new Arraylist<>();
while(there is next)
    arr.add(theWord)
Collections.sort(arr);

//this is your search method
boolean mysearch(keyword){
    return Collections.binarySearch(arr, keyword)
}

性能是：O(n*log_n)对于插入数据和搜索是O(log_n)

假设每个字符串平均为 20B。20B *200000 = 4MB空间。

java - Java : Datastructure to stock lots of words

4 回答 4

Related

Reference