java - Java LZW Compress working very slowly 20 seconds for 3MB

Question

Here is my code:

public static String compress(final String input)

{

HashMap<String, Integer> codes = new HashMap<String, Integer>();
for (int i = 0; i < 256; i++)
{
  codes.put((char) i + "", i);
}

StringBuilder outputString = new StringBuilder();

int max_code = 32767;
int next_code = 257;
String currentString = new String();
char c;

for (int i = 0; i < input.length(); i++)
{
  c = input.charAt(i);
  currentString = currentString + c;
  if (!codes.containsKey(currentString))
  {
    if (next_code <= max_code)
    {
      codes.put(currentString, next_code++);
    }
    currentString = currentString.substring(0, currentString.length() - 1);
    outputString.append(codes.get(currentString));
    currentString = c + "";
  }
}
outputString.append(codes.get(currentString));

return outputString.toString();

}

I have worked from the article: http://marknelson.us/2011/11/08/lzw-revisited/

I have read some articles stating that this method is naive and very slow: https://code.google.com/p/algorithms-and-datastructures-course/source/browse/trunk/AD_exercise_4/src/ad_exercise_4/controller/LZWNaive.java?r=38

How can I speed up the algorithm. At the moment it is taking 21 seconds to compress 3MB. Could somebody provide the pseudocode that I should be following to achieve quicker results. For example 1-2 seconds to compress 3MB.

I think the !HashMap.containsKey() is the line which costs an extreme amount of time. 16 out of the 21 seconds.

Regards.

score 0 · Accepted Answer

值得注意的一件事。String 类在 Java 中是不可变的。换句话说，使用 + 运算符附加到字符串实际上会创建一个新字符串。大量的字符串赋值操作会导致取消引用的字符串对象建立并触发垃圾收集，这会大大减慢你的速度。

至少，我会建议切换到 StringBuffer。如果没有大量的逻辑更改，您应该立即获得性能。但是 StringBuffer 仍然不是处理二进制数据的内存效率最高的方法，因为它被调整为处理不同字符集中的信息。对于压缩/解压缩，您不关心字符集，只关心位。

java.nio 包（Java 6）中的 ByteBuffer 将是一个巨大的飞跃。

score 0 · Accepted Answer

完成的一些操作currentString很昂贵，尤其是随着规模的currentString增长。

该声明：

    currentString = currentString + c;

循环遍历字符串中的所有字符并制作完整字符串 + 新字符的副本。

该行：

    if (!codes.containsKey(currentString))

利用currentString. 由于currentString每次都是一个新字符串，因此需要通过循环遍历整个字符串来计算哈希码（如果您每次都需要计算它，这会抵消哈希的用处）。

最后一行：

    currentString = currentString.substring(0, currentString.length() - 1);

还需要遍历整个字符串并制作它的新副本。

如果你想让这个程序快速运行，你需要消除一直循环遍历相同数据的需要。不要在每次要添加或删除字符时创建新字符串，而是使用某种缓冲区，您可以在其中添加和删除两端的字符。还要考虑另一种哈希码方案，因此您不需要仅仅因为您currentString使用 char 扩展而重新计算完整的哈希（通过循环整个字符串）。

java - Java LZW Compress working very slowly 20 seconds for 3MB

2 回答 2

Related

Reference