java - String.intern() 与手动字符串到标识符的映射？

Question

我记得看到一些字符串密集型程序，它们进行大量字符串比较但相对较少的字符串操作，并且使用单独的表将字符串映射到标识符以实现高效相等和减少内存占用，例如：

public class Name {
    public static Map<String, Name> names = new SomeMap<String, Name>();
    public static Name from(String s) {
        Name n = names.get(s);
        if (n == null) {
            n = new Name(s);
            names.put(s, n);
        }
        return n;
    }
    private final String str;
    private Name(String str) { this.str = str; }
    @Override public String toString() { return str; }
    // equals() and hashCode() are not overridden!
}

我很确定这些程序之一是来自 OpenJDK 的 javac，所以不是一些玩具应用程序。当然，实际的类更复杂（而且我认为它实现了 CharSequence），但你明白了 - 整个程序Name在你期望的任何位置都乱七八糟String，并且在需要字符串操作的极少数情况下，它转换了到字符串，然后再次缓存它们，概念上像：

Name newName = Name.from(name.toString().substring(5));

我想我理解这一点 - 特别是当周围有很多相同的字符串和很多比较时 - 但不能通过使用常规字符串和interning 来实现相同的效果吗？的文档String.intern()明确说：

...
当调用 intern 方法时，如果池中已经包含一个等于由 equals(Object) 方法确定的此 String 对象的字符串，则返回池中的字符串。否则，将此 String 对象添加到池中并返回对该 String 对象的引用。

由此可见，对于任何两个字符串 s 和 t，当且仅当 s.equals(t) 为真时，s.intern() == t.intern() 才为真。
...

那么，手动管理类类与使用类的优缺点是Nameintern()什么？

到目前为止我想到的是：

手动管理地图意味着使用常规堆，intern()使用 permgen。
手动管理地图时，您喜欢类型检查，可以验证某事是 a Name，而一个实习字符串和一个非实习字符串共享相同的类型，因此在某些地方可能会忘记实习。
依赖intern()意味着重用现有的、优化的、久经考验的机制，而无需编写任何额外的类。
手动管理地图会导致代码对新用户更加困惑，并且 strign 操作变得更加繁琐。

......但我觉得我在这里错过了其他东西。

score 2 · Accepted Answer

不幸的是，String.intern()它可能比简单的同步 HashMap 慢。它不需要那么慢，但在 Oracle 的 JDK 中，它是缓慢的（可能是由于 JNI）

要考虑的另一件事：您正在编写解析器；你在 a 中收集了一些字符char[]，你需要用它们制作一个字符串。由于字符串可能很常见并且可以共享，因此我们想使用一个池。

String.intern()使用这样的池；尚未查找，您需要一个 String 开头。所以我们需要new String(char[],offset,length)先。

我们可以避免自定义池中的开销，其中可以直接基于char[],offset,length. 例如，池是trie。字符串最有可能在池中，因此我们将在没有任何内存分配的情况下获得字符串。

如果我们不想编写自己的池，而是使用良好的旧 HashMap，我们仍然需要创建一个包装的键对象char[],offset,length（类似于 CharSequence）。这仍然比新字符串便宜，因为我们不复制字符。

score 1 · Accepted Answer

手动管理类名称类与使用 intern() 的优缺点是什么

类型检查是一个主要问题，但不变性保存也是一个重要问题。

向Name构造函数添加一个简单的检查

Name(String s) {
  if (!isValidName(s)) { throw new IllegalArgumentException(s); }
  ...
}

可以确保*不存在Name对应于无效名称的实例，例如"12#blue,,"这意味着将Names 作为参数并使用Name其他方法返回的 s 的方法不需要担心无效Names 可能会潜入的位置。

为了概括这个论点，想象你的代码是一座有围墙的城堡，旨在保护它免受无效输入的影响。你想要一些输入通过，所以你安装了带有防护装置的门，当输入通过时检查输入。构造函数是守卫的Name一个例子。

String和的区别在于NamesString不能防备。任何一段代码，无论是恶意的还是幼稚的，在边界之内或之外，都可以创建任何字符串值。BuggyString操纵代码类似于城堡内的僵尸爆发。守卫无法保护不变量，因为僵尸不需要越过它们。僵尸只是在移动过程中传播和破坏数据。

值 "is a"String满足的有用不变量比值 "is a" 少Name。

请参阅stringly typed以了解查看同一主题的另一种方式。

* - 通常警告重新反序列化Serializable允许绕过构造函数。

score 1 · Accepted Answer

我总是使用 Map 因为intern() 必须在内部字符串的字符串池中进行（可能是线性的）搜索。如果您经常这样做，它的效率不如 Map - Map 是为快速搜索而设计的。

score 1 · Accepted Answer

Java 5.0 & 6 中的 String.intern() 使用 perm gen 空间，该空间通常具有较低的最大大小。这可能意味着即使有大量可用堆，您也会用完空间。

Java 7 使用它的常规堆来存储 intern()ed 字符串。

字符串比较非常快，当您考虑开销时，我认为减少比较时间并没有太大优势。

这样做的另一个原因是如果有很多重复的字符串。如果有足够的重复，这可以节省大量的内存。

缓存字符串的更简单方法是使用 LRU 缓存，如 LinkedHashMap

private static final int MAX_SIZE = 10000;
private static final Map<String, String> STRING_CACHE = new LinkedHashMap<String, String>(MAX_SIZE*10/7, 0.70f, true) {
    @Override
    protected boolean removeEldestEntry(Map.Entry<String, String> eldest) {
        return size() > 10000;
    }
};

public static String intern(String s) {
    // s2 is a String equals to s, or null if its not there.
    String s2 = STRING_CACHE.get(s);
    if (s2 == null) {
        // put the string in the map if its not there already.
        s2 = s;
        STRING_CACHE.put(s2,s2);
    }
    return s2;
}

这是它如何工作的示例。

public static void main(String... args) {
    String lo = "lo";
    for (int i = 0; i < 10; i++) {
        String a = "hel" + lo + " " + (i & 1);
        String b = intern(a);
        System.out.println("String \"" + a + "\" has an id of "
                + Integer.toHexString(System.identityHashCode(a))
                + " after interning is has an id of "
                + Integer.toHexString(System.identityHashCode(b))
        );
    }
    System.out.println("The cache contains "+STRING_CACHE);
}

印刷

String "hello 0" has an id of 237360be after interning is has an id of 237360be
String "hello 1" has an id of 5736ab79 after interning is has an id of 5736ab79
String "hello 0" has an id of 38b72ce1 after interning is has an id of 237360be
String "hello 1" has an id of 64a06824 after interning is has an id of 5736ab79
String "hello 0" has an id of 115d533d after interning is has an id of 237360be
String "hello 1" has an id of 603d2b3 after interning is has an id of 5736ab79
String "hello 0" has an id of 64fde8da after interning is has an id of 237360be
String "hello 1" has an id of 59c27402 after interning is has an id of 5736ab79
String "hello 0" has an id of 6d4e5d57 after interning is has an id of 237360be
String "hello 1" has an id of 2a36bb87 after interning is has an id of 5736ab79
The cache contains {hello 0=hello 0, hello 1=hello 1}

这确保了 intern()ed Strings 的缓存数量将受到限制。

一种更快但不太有效的方法是使用固定数组。

private static final int MAX_SIZE = 10191;
private static final String[] STRING_CACHE = new String[MAX_SIZE];

public static String intern(String s) {
    int hash = (s.hashCode() & 0x7FFFFFFF) % MAX_SIZE;
    String s2 = STRING_CACHE[hash];
    if (!s.equals(s2))
        STRING_CACHE[hash] = s2 = s;
    return s2;
}

上面的测试是一样的，除了你需要

System.out.println("The cache contains "+ new HashSet<String>(Arrays.asList(STRING_CACHE)));

打印出显示以下内容的内容包括null空条目。

The cache contains [null, hello 1, hello 0]

这种方法的优点是速度快，并且可以被多个线程安全地使用而无需锁定。即不同的线程是否有不同的STRING_CACHE 视图并不重要。

score 0 · Accepted Answer

那么，手动管理类 Name 类与使用 intern() 的优缺点是什么？

一个优点是：

由此可见，对于任何两个字符串 s 和 t，当且仅当 s.equals(t) 为真时，s.intern() == t.intern() 才为真。

在一个必须经常比较许多小字符串的程序中，这可能会得到回报。此外，它最终节省了空间。AbstractSyntaxTreeNodeItemFactorySerializer考虑一个经常使用类似名称的源程序。使用intern()，这个字符串将被存储一次，就是这样。其他一切，如果只是参考，但无论如何你都有参考。

java - String.intern() 与手动字符串到标识符的映射？

5 回答 5

Related

Reference