6

嗨,我有以下问题:我正在存储字符串和相应的整数值列表,MultiValueMap<String, Integer> 我正在存储大约 130000 亿个字符串,一个字符串最多可以有 500 个或更多值。对于每一个值,我将在地图上随机访问。所以最坏的情况是 13 000 000 * 500 个看跌期权。现在地图的速度很好,但内存开销变得相当高。AMultiValueMap<String, Integer>什么都不是,然后 a HashMap/TreeMap<String, <ArrayList<Integer>>。HashMap 和 TreeMap 都有相当多的内存开销。完成后我不会修改地图,但我需要它快速且尽可能小,以便在程序中进行随机访问。(我将它存储在磁盘上并在启动时加载它,序列化的地图文件占用大约 600mb 但在内存中大约 3gb?)

最节省内存的事情是,将字符串存储在已排序的字符串数组中,并为值提供相应的二维 int 数组。因此访问将是对字符串数组的二进制搜索并获取相应的值。

现在我有三种方法可以到达那里:

  1. 我在创建阶段使用排序的 MultivalueMap (TreeMap) 来存储所有内容。在完成获取所有值之后,我通过调用map.keyset().toArray(new String[0]);Make a 二维 int 数组来获取字符串数组,并从 multivaluemap 中获取所有值。优点:很容易实现,在创建过程中仍然很快。缺点:在从 Map 复制到 Arrays 的过程中,它占用了更多的内存。

  2. 我从一开始就使用 Arrays 或 ArrayLists 并将所有内容存储在其中 Pro:内存开销最小。缺点:这将非常慢,因为每次添加新键时我都必须对数组进行排序/复制,而且我需要实现自己的(可能更慢)排序以保持相应的 int 数组的顺序相同,例如字符串。难以实施

  3. 我使用数组和 MultivalueMap 作为缓冲区。在程序完成 10% 或 20% 的创建阶段后,我会将值添加到 Arrays 并保持它们的顺序,然后启动一个新的 Map。优点:可能仍然足够快且内存足够高效。缺点:难以实施。

这些解决方案对我来说都不合适。你知道这个问题的任何其他解决方案,也许是内存高效的 (MultiValue)Map 实现?

我知道我可能正在使用数据库,所以不要费心将其发布为答案。我想知道如何在不使用数据库的情况下做到这一点。

4

5 回答 5

5

如果你切换到 Guava 的 Multimap——我不知道这是否适用于你的应用程序——你也许可以使用 Trove 并获得

ListMultimap<String, Integer> multimap = Multimaps.newListMultimap(
  new HashMap<String, Collection<Integer>>(),
  new Supplier<List<Integer>>() {
    public List<Integer> get() {
      return new TIntListDecorator();
    }
  });

这将使 aListMultimap使用 aHashMap映射到List由数组支持的值,这应该是内存效率的,尽管您会因为装箱而int[]付出一点速度损失。您也许可以为 做类似的事情MultiValueMap,尽管我不知道它来自哪个库。

于 2012-02-16T22:22:10.137 回答
2

您可以使用压缩字符串来大幅减少内存使用量。

此外,还有其他更激进的解决方案(需要重新实现):

于 2012-02-16T22:20:20.393 回答
2

首先,考虑整数占用的内存。你说范围大约是0-4000000。24 位足以表示 16777216 个不同的值。如果可以接受,您可以对整数使用字节数组,每个整数 3 个字节,并节省 25%。您必须像这样索引到数组中:

int getPackedInt(byte[] array, int index) {
  int i = index*3;
  return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF);
}
int storePackedInt(byte[] array, int index, int value) {
  assert value >= 0 && value <= 0xFFFFFF;
  int i = index*3;
  array[i]   = (byte)((value>>16) & 0xFF);
  array[i+1] = (byte)((value>>8)  & 0xFF);
  array[i+2] = (byte)(value & 0xFF);
}

你能谈谈整数的分布吗?如果它们中的许多都适合 16 位,则可以使用每个数字的字节数可变的编码(类似于 UTF-8 用于表示字符的方式)。

接下来,考虑是否可以在字符串上节省内存。弦乐有什么特点?他们通常会持续多久?许多字符串会共享前缀吗?根据您的应用程序的特点量身定制的压缩方案可以节省大量空间(正如 falsarella 指出的那样)。或者,如果许多字符串将共享前缀,则将它们存储在某种类型的搜索树中可能会更有效。(有一种叫做“patricia”的树可能适合这个应用程序。)作为奖励,请注意在树中搜索字符串可能比搜索哈希映射更快(尽管您必须进行基准测试才能看到如果这在您的应用程序中是正确的)。

字符串都是ASCII吗?如果是这样,将浪费 50% 的用于字符串的内存,因为 Javachar是 16 位的。同样,在这种情况下,您可以考虑使用字节数组。

如果您只需要查找字符串,而不是遍历存储的字符串,您还可以考虑一些非常规的方法:对字符串进行哈希处理,并且只保留哈希。由于不同的 String 可以散列到相同的值,因此从未存储过的 String 仍有可能通过搜索“找到”。但是,如果您使用足够多的位作为散列值(以及一个好的散列函数),您可以使这个机会变得非常小,以至于在估计的宇宙寿命中几乎肯定不会发生。

最后是结构本身的内存,它保存字符串和整数。我已经建议使用 trie,但是如果您决定不这样做,那么没有什么比并行数组使用更少的内存了——一个排序的字符串数组(如您所说,您可以对其进行二进制搜索)和一个并行数组整数数组。在执行二进制搜索以找到字符串数组的索引后,您可以使用相同的索引来访问整数数组数组。

在构建结构时,如果您确实认为搜索树是一个不错的选择,我会直接使用它。否则,您可以执行 2 次传递:一次构建一组字符串(然后将它们放入一个数组并对其进行排序),第二次传递以添加整数数组。

于 2012-02-16T22:31:46.413 回答
2

如果您的关键字符串存在模式,尤其是常见的根,那么 aa Trie可能是存储显着减少数据的有效方法。

这是工作 TrieMap 的代码。

注意:关于使用EntrySet迭代Maps的通常建议不适用于Tries。它们的效率非常低,Trie因此请尽可能避免请求。

/**
 * Implementation of a Trie structure.
 * 
 * A Trie is a compact form of tree that takes advantage of common prefixes
 * to the keys. 
 * 
 * A normal HashSet will take the key and compute a hash from it, this hash will
 * be used to locate the value through various methods but usually some kind
 * of bucket system is used. The memory footprint resulting becomes something
 * like O(n).
 * 
 * A Trie structure essentuially combines all common prefixes into a single key.
 * For example, holding the strings A, AB, ABC and ABCD will only take enough 
 * space to record the presence of ABCD. The presence of the others will be 
 * recorded as flags within the record of ABCD structure at zero cost.
 * 
 * This structure is useful for holding similar strings such as product IDs or
 * credit card numbers.
 *  
 */
public class TrieMap<V> extends AbstractMap<String, V> implements Map<String, V> {


  /**
   * Map each character to a sub-trie.
   * 
   * Could replace this with a 256 entry array of Tries but this will handle
   * multibyte character sets and I can discard empty maps. 
   * 
   * Maintained at null until needed (for better memory footprint).
   * 
   */
  protected Map<Character, TrieMap<V>> children = null;

  /**
   * Here we store the map contents.
   */
  protected V leaf = null;

  /**
   * Set the leaf value to a new setting and return the old one.
   * 
   * @param newValue
   * @return old value of leaf.
   */
  protected V setLeaf(V newValue) {
    V old = leaf;
    leaf = newValue;
    return old;
  }

  /**
   * I've always wanted to name a method something like this.
   */
  protected void makeChildren () {
    if ( children == null ) {
      // Use a TreeMap to ensure sorted iteration.
      children = new TreeMap<Character, TrieMap<V>>();
    }
  }

  /**
   * Finds the TrieMap that "should" contain the key.
   * 
   * @param key 
   * 
   * The key to find.
   * 
   * @param grow 
   * 
   * Set to true to grow the Trie to fit the key.
   * 
   * @return 
   * 
   * The sub Trie that "should" contain the key or null if key was not found and
   * grow was false.
   */
  protected TrieMap<V> find(String key, boolean grow) {
    if (key.length() == 0) {
      // Found it!
      return this;
    } else {
      // Not at end of string.
      if (grow) {
        // Grow the tree.
        makeChildren();
      }
      if (children != null) {
        // Ask the kids.
        char ch = key.charAt(0);
        TrieMap<V> child = children.get(ch);
        if (child == null && grow) {
          // Make the child.
          child = new TrieMap<V>();
          // Store the child.
          children.put(ch, child);
        }
        if (child != null) {
          // Find it in the child.
          return child.find(tail(key), grow);
        }
      }
    }
    return null;

  }

  /**
   * Remove the head (first character) from the string.
   * 
   * @param s
   * 
   * The string.
   * 
   * @return 
   * 
   * The same string without the first (head) character.
   * 
   */
  // Suppress warnings over taking a subsequence
  private String tail(String s) {
    return s.substring(1, s.length());
  }

  /**
   * 
   * Add a new value to the map.
   * 
   * Time footprint = O(s.length).
   * 
   * @param s
   * 
   * The key defining the place to add.
   * 
   * @param value
   * 
   * The value to add there.
   * 
   * @return
   * 
   * The value that was there, or null if it wasn't.
   * 
   */
  @Override
  public V put(String key, V value) {
    V old = null;

    // If empty string.
    if (key.length() == 0) {
      old = setLeaf(value);
    } else {
      // Find it.
      old = find(key, true).put("", value);
    }

    return old;
  }

  /**
   * Gets the value at the specified key position.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value at that location, or null if there is no value at that location.
   */
  @Override
  public V get(Object o) {
    V got = null;
    if (o != null) {
      String key = (String) o;
      TrieMap<V> it = find(key, false);
      if (it != null) {
        got = it.leaf;
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return got;
  }

  /**
   * Remove the value at the specified location.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value that was removed, or null if there was no value at that location.
   */
  @Override
  public V remove(Object o) {
    V old = null;
    if (o != null) {
      String key = (String) o;
      if (key.length() == 0) {
        // Its me!
        old = leaf;
        leaf = null;
      } else {
        TrieMap<V> it = find(key, false);
        if (it != null) {
          old = it.remove("");
        }
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return old;
  }

  /**
   * Count the number of values in the structure.
   * 
   * @return 
   * 
   * The number of values in the structure.
   */
  @Override
  public int size() {
    // If I am a leaf then size increases by 1.
    int size = leaf != null ? 1 : 0;
    if (children != null) {
      // Add sizes of all my children.
      for (Character c : children.keySet()) {
        size += children.get(c).size();
      }
    }
    return size;
  }

  /**
   * Is the tree empty?
   * 
   * @return 
   * 
   * true if the tree is empty.
   * false if there is still at least one value in the tree.
   */
  @Override
  public boolean isEmpty() {
    // I am empty if I am not a leaf and I have no children 
    // (slightly quicker than the AbstaractCollection implementation).
    return leaf == null && (children == null || children.isEmpty());
  }

  /**
   * Returns all keys as a Set.
   * 
   * @return 
   * 
   * A HashSet of all keys.
   * 
   * Note: Although it returns Set<S> it is actually a Set<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   */
  @Override
  public Set<String> keySet() {
    // Roll them a temporary list and give them a Set from it.
    return new HashSet<String>(keyList());
  }

  /**
   * List all my keys.
   * 
   * @return 
   * 
   * An ArrayList of all keys in the tree.
   * 
   * Note: Although it returns List<S> it is actually a List<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   * 
   */
  protected List<String> keyList() {
    List<String> contents = new ArrayList<String>();

    if (leaf != null) {
      // If I am a leaf, a null string is in the set.
      contents.add((String) "");
    }

    // Add all sub-tries.
    if (children != null) {
      for (Character c : children.keySet()) {
        TrieMap<V> child = children.get(c);
        List<String> childContents = child.keyList();
        for (String subString : childContents) {
          // All possible substrings can be prepended with this character.
          contents.add((String) (c + subString.toString()));
        }
      }
    }

    return contents;
  }

  /**
   * Does the map contain the specified key.
   * 
   * @param key
   * 
   * The key to look for.
   * 
   * @return 
   * 
   * true if the key is in the Map.
   * false if not.
   */
  public boolean containsKey(String key) {
    TrieMap<V> it = find(key, false);
    if (it != null) {
      return it.leaf != null;
    }
    return false;
  }

  /**
   * Represent me as a list.
   * 
   * @return 
   * 
   * A String representation of the tree.
   */
  @Override
  public String toString() {
    List<String> list = keyList();
    //Collections.sort((List<String>)list);
    StringBuilder sb = new StringBuilder();
    Separator comma = new Separator(",");
    sb.append("{");
    for (String s : list) {
      sb.append(comma.sep()).append(s).append("=").append(get(s));
    }
    sb.append("}");
    return sb.toString();
  }

  /**
   * Clear down completely.
   */
  @Override
  public void clear() {
    children = null;
    leaf = null;
  }

  /**
   * Return a list of key/value pairs.
   * 
   * @return 
   * 
   * The entry set.
   */
  public Set<Map.Entry<String, V>> entrySet() {
    Set<Map.Entry<String, V>> entries = new HashSet<Map.Entry<String, V>>();
    List<String> keys = keyList();
    for (String key : keys) {
      entries.add(new Entry<String,V>(key, get(key)));
    }
    return entries;
  }

  /**
   * An entry.
   * 
   * @param <S>
   * 
   * The type of the key.
   * 
   * @param <V> 
   * 
   * The type of the value.
   */
  private static class Entry<S, V> implements Map.Entry<S, V> {

    protected S key;
    protected V value;

    public Entry(S key, V value) {
      this.key = key;
      this.value = value;
    }

    public S getKey() {
      return key;
    }

    public V getValue() {
      return value;
    }

    public V setValue(V newValue) {
      V oldValue = value;
      value = newValue;
      return oldValue;
    }

    @Override
    public boolean equals(Object o) {
      if (!(o instanceof TrieMap.Entry)) {
        return false;
      }
      Entry e = (Entry) o;
      return (key == null ? e.getKey() == null : key.equals(e.getKey()))
              && (value == null ? e.getValue() == null : value.equals(e.getValue()));
    }

    @Override
    public int hashCode() {
      int keyHash = (key == null ? 0 : key.hashCode());
      int valueHash = (value == null ? 0 : value.hashCode());
      return keyHash ^ valueHash;
    }

    @Override
    public String toString() {
      return key + "=" + value;
    }
  }
}
于 2012-02-16T23:51:24.470 回答
2

根据您在映射中存储的 Integer 值,大量的堆内存开销可能是由具有不同的 Integer 实例引起的,这比原始 int 值占用更多的 RAM。

考虑使用Mapfrom to浮动String的众多实现之一(例如在ColtJava 的 Primitive Collections 中),它基本上实现了一个由 int 数组支持的 List,而不是由一个 Integer 实例数组支持。IntArrayList

于 2012-02-16T21:52:11.443 回答