java - 二进制搜索以找到最长的公共前缀

Question

对于学校作业，我们正在实现 suffixarray，使用构建它的方法并找到最长的公共前缀。我设法很容易地构建和排序后缀数组，但在 LCP 上遇到了困难。

我正在尝试使用一个奇异的二进制搜索在另一个字符串 T 中找到模式字符串 P 的最长公共前缀。该算法应该返回最长公共前缀开始的索引。

例子：

如果模式字符串 P 是“racad”并且字符串 T 是“abracadabra”，那么最长的公共前缀应该是“racad”，从索引 2 开始。

同样，如果模式字符串是 P“rax”，那么最长的公共前缀应该是“ra”，从索引 2 或 9 开始。

我已经走了很远，但算法没有返回正确的值。这是我的代码：

public int compareWithSuffix(int i, String pattern) {
     int c = 0;
     int j = 0;

    while (j < pattern.length() && c == 0) {
        if (i + j <= text.length()) {
        c = pattern.charAt(0 + j) - text.charAt(i + j);
        } else {
            c = 1;
        }
        j++;
    }
    return c;
}

public int binarySearch(String pattern) {
    int left = 0;
    int right = text.length() - 1;
    int mid, c = 0;

    while (c != 0 && left <= right) {
        mid = left + (right - left) / 2;
        c = compareWithSuffix(mid, pattern);

        if (c < 0) {
            right = mid - 1;
        } else if (c > 0) {
            left = mid + 1;
        } else if (c == 0) {
            return mid;
        }
    }
    return left;
}

我用这个主要方法运行它：

public static void main(String[] args) {
    String word = "abracadabra";
    String prefix1 = "rax";
    String prefix2 = "racad";
    SuffixArray s = new SuffixArray(word);

    System.out.println("Longest common prefix of: " + "'" + prefix1 + "'" + " in " + "'" + word + "'" + " begins at index: " + s.binarySearch(prefix1));
    System.out.println("Longest common prefix of: " + "'" + prefix2 + "'" + " in " + "'" + word + "'" + " begins at index: " + s.binarySearch(prefix2));
}

输出始终是我初始化局部变量的任何值left。

搜索算法必须进行奇异二分搜索。我试过搜索其他 stackoverflow-questions 和其他网络资源，但没有发现任何有用的东西。

谁能在我的代码中看到任何错误？

score 1 · Accepted Answer

我还没有深入了解这是否是您的代码中唯一的问题，但这立即跳出来作为对“输出始终是我初始化局部变量留下的任何值”的解释：

int mid, c = 0;

while (c != 0 && left <= right) {

您设置c为零，然后立即检查它是否不等于零。当然，它不等于零，所以循环条件立即为假，因此循环体永远不会运行。因此，您将返回的初始值left。

你为什么要检查并不明显c。c在循环内变为零的唯一情况下，您立即返回。所以只需将您的循环保护更改为：

while (left <= right) {

（并移动c循环内的声明）。

通过使用调试器单步执行代码，您可以很容易地发现这一点。我衷心推荐学习如何使用一个。

score 0 · Accepted Answer

我在这里提供了一个不同的答案，因为方法完全不同，并导致了一个通用的解决方案。（找到整个字符串列表的公共子字符串）

对于每个单词，我构建了所有可能的子字符串。子字符串由它的开始和结束索引决定。如果一个词的长度是L，那么起始索引可以是：0,1,2,...L-1；如果起始索引为 0，则结束索引可以取 1 到 L-1 之间的值，即 L-1 值；如果起始索引为 1，则结束索引有 L-2 个可能值。因此，长度为 L 的单词有 (L-1) +(L-2) + ... +1 = L*(L-1)/2 个子串。这给出了关于 L 的平方复杂度，但这不是问题，因为单词的长度很少超过 15 个字母。如果字符串不是一个单词，而是一个文本段落，那么我们就会遇到平方复杂度的问题。

接下来，在我为每个单词构建了一组子字符串之后，我构建了这些集合的交集。主要思想是，更多单词的公共子字符串首先是每个此类单词中的子字符串，此外，所有这些单词中都会遇到一个子字符串。这导致了为每个单词构建一组子字符串的想法，然后进行交集。

在我们找到所有公共子串之后，只需迭代并保留最长的一个

import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;

public class Main4 {

    HashSet<String> allSubstrings(String input)
    {
        HashSet<String> result = new HashSet<String>();
        for(int i=0;i<=input.length()-1;i++)
            for(int j=i+1;j<=input.length();j++)
                result.add(input.substring(i,j));

        return result;
    }

    public HashSet<String> allCommonSubstrings(ArrayList<String> listOfStrings)
    {

        ArrayList<HashSet<String>> listOfSetsOfSubstrings =new ArrayList<HashSet<String>>();
        //for each string in the list, build the set of all its possible substrings
        for(int i=0;i<listOfStrings.size();i++)
        {
            String currentString = listOfStrings.get(i);
            HashSet<String> allSubstrings = allSubstrings(currentString);
            listOfSetsOfSubstrings.add(allSubstrings);
        }

        //find the intersection of all the sets of substrings
        HashSet<String> intersection = new HashSet<String>(listOfSetsOfSubstrings.get(0));
        for(int i=0;i<listOfSetsOfSubstrings.size();i++)
        {
            HashSet<String> currentSet=listOfSetsOfSubstrings.get(i);
            intersection.retainAll(currentSet);
            //retainAll does the set intersection. see: https://stackoverflow.com/questions/8882097/how-to-calculate-the-intersection-of-two-sets

        }

        return intersection;

    }

    public String longestCommonSubstring(HashSet<String> setOfSubstrings)
    {
        if(setOfSubstrings.size()==0)
            return null;//if there are no common substrings, then there is no longest common substrings

        String result="";
        Iterator<String> it = setOfSubstrings.iterator();
        while(it.hasNext())
        {
            String current = it.next();
            if(current.length()>result.length())
                result=current;
        }

        return result;
    }

    public static void main(String[] args)
    {
        Main4 m = new Main4();
        ArrayList<String> list=new ArrayList<String>();
        list.add("bbbaaddd1");
        list.add("bbbaaccc1");
        list.add("dddaaccc1");
        HashSet<String> hset = m.allCommonSubstrings(list);
        Iterator<String> it = hset.iterator();
        System.out.println("all coommon substrings:");
        while(it.hasNext())
        {
            System.out.println(it.next());
        }
        System.out.println("longest common substring:");
        String lcs=m.longestCommonSubstring(hset);
        System.out.println(lcs);
    }
}

输出：

all coommon substrings:
aa
a
1
longest common substring:
aa

score 0 · Accepted Answer

第一点：分析您给出的示例，您似乎对最长公共前缀不感兴趣，而是对最长公共子串感兴趣。前缀总是以单词的第一个字母开头 - https://en.wikipedia.org/wiki/Prefix

第二点：也许您对找到一组单词的最长公共子串感兴趣，或者只是两个单词？

public class Main3 {

/*
same functionallity as compareWithSuffix, but i think this name
is more suggestive; also, no need to pass i (the starting index) as a
parameter; i will later apply substring(i) to text      
        */
public String longestCommonPrefix(String text, String pattern)
{
    String commonPrefix="";
    for(int j=0; j<text.length() & j<pattern.length(); j++)
    {
        if(text.charAt(j)==pattern.charAt(j))
        {
            commonPrefix=commonPrefix+text.charAt(j);
        }
        else
        {
            break;
        }
    }

    return commonPrefix;
    //for params "abc", "abd", output will be "ab"
}

public String longestCommonSequence(String s1, String s2)
{
    /*
    take for example "xab" and "yab";in order to find the common
    sequence 'ab", we need to chop both x and y; for this reason
    i apply substring to both strings, cutting progressivelly their first letters       
            */
    String longestCommonSequence="";
    for(int i=0;i<=s1.length()-1;i++)
    {
        for(int j=0;j<=s2.length()-1;j++)
        {
            String s1_temp=s1.substring(i);
            String s2_temp=s2.substring(j);
            String commonSequence_temp=longestCommonPrefix(s1_temp, s2_temp);
            if(commonSequence_temp.length()>longestCommonSequence.length())
                longestCommonSequence=commonSequence_temp;
        }
    }

    return longestCommonSequence; 
}



public static void main(String args[])
{
    Main3 m = new Main3();
    String common = m.longestCommonSequence("111abcd2222", "33abcd444");
    System.out.println(common);//"abcd"

}
}

java - 二进制搜索以找到最长的公共前缀

3 回答 3

Related

Reference