java - Memory and speed efficient search on Strings

Question

I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.

How do I best store the Strings and search for them?

My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...

Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.

Thanks!

Update

For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.

My VPS is a 32bit OS, so...

one byte[], all concatenated = 12 + length bytes
one char[], all concatenated = 12 + length * 2 bytes
String = 32 + length * 2 bytes (is Object, has char[] + 3 int)

So I have to keep in memory:

~7 MB if all are stored in a byte[]
~14 MB if all are stored in a char[]
~25 MB if all are stored in a String[]
> 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)

A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...

score 1 · Accepted Answer

听起来HashTable将是这种情况的正确实现。

搜索在恒定时间内完成，刷新可以在线性时间内完成。

Java 数据结构 Big-O（警告 PDF）

score 1 · Accepted Answer

我可能会为此使用缓存解决方案，甚至可能是番石榴。当然对它们进行排序，然后是二进制搜索。不幸的是，我没有时间 :(

score 1 · Accepted Answer

我自己编写了一个解决方案 - 但它与我发布的问题有点不同，因为我可以使用我没有发布的信息（我下次会做得更好，抱歉）。

我只是回答这个问题，因为它已经解决了，我不会接受其他答案之一，因为它们并没有真正帮助解决内存限制（而且对我的口味来说有点短）。他们仍然得到了每个人的支持，没有难过的感觉，感谢您抽出宝贵的时间！

我设法将所有信息推送到两个 long 中（密钥完全位于第一个中）。前 12 个字符是一个 ISIN，可以压缩成一个长字符，因为它只使用数字和大写字母，总是以两个大写字母开头，以一个可以从其他字符重构的数字结尾。所有可能值的乘积留下了略多于 3 位的余量。

我将源文件中的所有条目存储在一个long[]（首先打包的 ISIN，第二个 long 中的其他内容）中，并根据两个 long 中的第一个对它们进行排序。

当我通过键进行查询时，我将其转换为 long，进行二进制搜索（我可能会更改为插值搜索）并返回匹配索引。值的不同部分可以通过所述索引检索 - 我从数组中获取第二个 long，解包并返回请求的数据。

结果：RAM 使用量从 ~110 MB 下降到 < 50 MB，包括 Jetty（顺便说一句——我之前使用过 HashTable），并且查找速度快如闪电。

java - Memory and speed efficient search on Strings

3 回答 3

Related