java - Burrows Wheeler 变换的优化

Question

您好，我在优化burrows Wheeler 变换时遇到了一些困难。我正在尝试转换文本文件，但是转换像圣经这样的大文本文件需要的时间太长了。

关于如何进行的任何想法？

public BurrowsWheelerTransformEncoder()
{

}

private String originalSuffix(int index, String string)
{
    String temp = (string.substring(index,string.length()) + string.substring(0,index));

    //this bit just 'compresses' each transformation of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so minimal amount of memory is used when it is stored in an array

    return temp.substring(0,5)+
    //the last character of the transformation is kept
           temp.charAt(temp.length()-1);
}

private String compressedSuffix(String string)
{
    //this method just 'compresses' original piece of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so comprisons won't take so long
    return string.substring(0,5)+string.charAt(string.length()-1);
}

public static void main(String args[]) throws Exception
{
    BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
    BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));

    String text = "";
    //the row in the sorted array where the original text can be found
    int originalRow = 0;
    //system time when program began
    long startTime = System.nanoTime();

    //get text from file
    while(input.ready())
    {
        text += input.readLine();
    }
    //create a new array to hold all transformations
    String[] textArray = new String[text.length()];
    int length = text.length();

    //get individual transformations and put in array
    for(int i = 0; i < text.length(); i++)
    {
        textArray[i] = encoder.originalSuffix(i,text);
        //for debugging large text files, prints progress after every 10k'th 
        //transformation
        if(i%10000==0)
        System.out.println(i+"/"+length);
    }
    //uses java's internal methods to sort the array, presumably 
    //the most efficient way to do the sort (for now)
    Arrays.sort(textArray);

    String compressedOriginalText = encoder.compressedSuffix(text);

    //print the results
    for(int i = 0; i < textArray.length; i++)
    {
        if(textArray[i].equals(compressedOriginalText))
        {
            originalRow = i;
        }
        if(i%100==0)
        {
            System.out.println();
        }
        System.out.print(textArray[i].charAt(textArray[i].length()-1));
    }
    System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
    System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
 }

score 3 · Accepted Answer

对于编码情况，您不需要实际构建字符串数组 - 使用 int （或 long 取决于您的文件大小）数组来存储旋转字符串开始的索引。

创建一个初始化为 [0 1 2 3 ... n] 的数组

使用以下 compareTo 对数组进行排序（假设compareTo()可以访问原始字符串，original）：

int compareTo(int a, int b){
    int compare, len = original.length();
    do{
        char _a = original.charAt(a), _b = original.charAt(b);
        compare = _a-_b;
        a++; b++;
        if(a>=len)a-=len;
        if(b>=len)b-=len;
    }while(compare==0);
    return compare;
}

注意数组中“0”的索引，并将其作为“开始”值添加到您的输出中

对于反转，我们再次希望避免为像圣经一样大的文本构建整个表格。我们可以通过使用第一行和最后一行中相同的标记总是以相同的顺序来做到这一点。之所以如此，是因为第一行是排序的，并且token是循环排列的：对于最后一行连续三个b，它们后面的token是排序的，所以b是排序的。所以要反转：

对输出标记进行排序。除了存储排序的标记外，还要存储每个标记开始的索引。因此，对于未排序的标记“nbnaaa”，您将存储 [3 4 5 2 0 1] 和“aaabnn”。重要提示：您必须在此步骤中使用稳定的排序。

使用前面提到的“开始”值来重建字符串：

string decode(string sorted, int[]index, int start){
    string answer = ""+sorted.charAt(start);
    int next = index[start];
    while(next!=start){
        answer = sorted.charAt(next) + answer;
        next = index[next];
    }
    return answer;
}

score 1 · Accepted Answer

这一行：

    String temp = (string.substring(index,string.length()) + string.substring(0,index));

每次调用它时都会创建整个输入文本的副本。由于您为 N 个字符的输入文本调用 N 次，因此您的算法将是O(N^2).

看看您是否可以优化该originalSuffix方法以避免该复制。

java - Burrows Wheeler 变换的优化

2 回答 2

Related

Reference