java - 查找包含给定单词的最短子字符串的方法：需要优化

Question

我有一个程序要求我找到给定字符串的最短子段，其中包含单词列表。即使我的程序是正确的，我也未能在执行时间范围内（5 秒）交付。我认为问题出在我使用的复杂（琐碎）算法上。它由嵌套循环组成，需要对 list_of_words 数组进行多次扫描。这是我的搜索功能代码。a[]包含由单词存储的原始字符串，b[]包含要找到以形成子段的单词列表。String g存储由原始字符串中的单词组成的临时子段，包括列表中的单词。

private static void search() // Function to find the subsegment with a required list of words
{
   int tail,x;//counters 
   String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.

   for(int i =0; i<a.length;i++)// looping throw original string array
    {
       System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array

        for (int j=0;j<b.length;j++)//looping throw the temporary array
        { 
            x=0; //counter for temporary array

            if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
            {
                tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
                while((x<b.length)&&(tail<a.length))

                {
                    g=g+" "+a[tail];//adds the words in the string g

                    for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""    
                    {
                        if(c[k].equalsIgnoreCase(a[tail]))
                        {
                            c[k]=""; x++;
                        }
                    }
                    tail++;

                }
                if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
                {
                    g="";
                }
                else
                    {
                    count(g);g="";//check for the shortest string.
                    }
            }
        }

    }print();
}

样本：

原始字符串：这是一个测试。这是一个编程测试。这是一个编程测试。

要找到的词：this，test，a，programming。

细分：

这是一个测试这是一个编程

这是一个编程测试

一个编程测试一个编程测试这个

编程测试编程测试 this

测试一个编程测试这个

一个编程测试这个

最短子段：一个编程测试这个

任何有关数据结构或循环结构的更改，甚至优化相同的算法的更改的建议都会有所帮助。

score 7 · Accepted Answer

动态规划解决方案：

为您要查找的每个单词设置最后一个位置变量。

拥有您正在寻找的不同可见词的总数（永远不会减少，最大值 = 您正在寻找的词的数量）。

对于输入中的每个单词位置：

如果您要查找的单词列表中存在该单词，请更新该单词的最后一个位置。
如果更新的最后一个位置未初始化，则增加总计数。
如果总计数等于最大值，则遍历最后一个位置并找到最小的位置。当前位置和该值之间的距离将是子字符串的长度。记录这些值并在所有位置中找到最佳值。

优化是为最后一个位置创建一个堆，以减少找到最小位置所花费的时间（应该与一些结构（可能是散列图或树图）一起使用，该结构允许在给定单词的情况下快速查找堆中的指针）。

例子：

输入：This is a test. This is a programming test. a programming test this is

寻找：this, test, a, programming

                1    2  3  4     5    6  7  8           9     10 11          12   13   14
                This is a  test. This is a  programming test. a  programming test this is
this         -1 1    1  1  1     5    5  5  5           5     5  5           5    13   13
test         -1 -1   -1 -1 4     4    4  4  4           9     9  9           12   12   12
a            -1 -1   -1 3  3     3    3  7  7           7     10 10          10   10   10
programming  -1 -1   -1 -1 -1    -1   -1 -1 8           8     8  11          11   11   11
Count        0  1    1  2  3     3    3  3  4           4     4  4           4    4    4
Substr len   NA NA   NA NA NA    NA   NA NA 5           5     6  7           8    4    5
Shortest len NA NA   NA NA NA    NA   NA NA 5           5     5  5           5    4    4

最佳结果：a programming test this，长度 = 4。

复杂性分析：

让我们n成为输入中k的单词数和我们正在寻找的单词数。

该算法只通过输入一次，并且在每一步都O(log k)为getMin操作工作（使用堆优化）。

因此所用的总时间为O(n log k)。

处理重复：

如果我们要查找的单词中允许重复（并且目标序列必须匹配所有出现的单词），上面的算法将无法正常工作，但一个简单的解决方法是让每个不同的单词都有自己的指针堆原始堆（此堆中的值与原始堆中的值相同），此堆的最大大小是该单词在我们正在查找的单词中出现的次数。

score 4 · Accepted Answer

这是我想到的实现。

//Implementing here with two List<String>
//Should be easy enough to use arrays, or streams, or whatever.
public static int getShortestSubseqWith(List<String> text, List<String> words) {
    int minDistance = Integer.MAX_VALUE;
    //Create a map of the last known position of each word
    Map<String, Integer> map = new HashMap();
    for (String word : words) {
        map.put(word, -1);
    }
    String word;
    //One loop through the main search string
    for (int position = 0; position < text.size(); position++){
        word = text.get(position);
        //If the current word found is in the list we're looking for
        if (map.containsKey(word)) {
            //Update the map
            map.put(word, position);
            //And if the current positions are the closest seen so far, update the min value.
            int curDistance = getCurDistance(map);
            if (curDistance < minDistance)
                minDistance = curDistance;
        }
    }
    return minDistance;
}

//Get the current distance between the last known position of each value in the map
private static int getCurDistance(Map<String, Integer> map) {
    int min = Integer.MAX_VALUE;
    int max = 0;
    for (Integer value : map.values()) {
        if (value == -1)
            return Integer.MAX_VALUE;
        else {
            max = Math.max(max,value);
            min = Math.min(min,value);
        }
    }
    return max - min;
}

这里的主要性能影响，如果命中相对稀疏，并且要搜索的术语列表相对较小，则应该只是要text搜索的循环。如果命中非常频繁，由于更频繁的运行，性能可能会受到影响getCurDistance。

score 2 · Accepted Answer

另一种方法可能是将 b[] 中的每个单词映射到 a[] 中的出现索引。

Map<Integer, List<Integer>> occurrence = new HashMap<Integer, List<Integer>>();
for(int idx = 0; idx < a.length; idx++)
{
  int bIdx = ... retrieve the index of the word a[idx] in b or -1 if it doesn't exist;

  if(bIdx >= 0)
  {
    List<Integer> bIdxOccurs = occurrence.get(bIdx);
    //some code to initially create the lists
    bIdxOccurs.add(idx);
  }
}

然后从图中索引彼此最接近的每个单词中找到出现的组合。天真的方法是生成每个组合并比较最小和最大索引之间的距离，但可能有更快的方法。我得好好想想……

最后，从 a[] 中取出位于最短序列中找到的最小索引和最大索引之间的每个单词。

score 1 · Accepted Answer

String[] a; // larger string
String[] b; // list of words to search

int index = -1;

for (int i = 0; i < a.length - b.length; i++)
{
    HashSet<String> set = new HashSet<String>(b.length);
    for (String s : b)
        set.add(s);

    boolean found = true;

    for (int j = 0; j < b.length; j++)
    {
        if (set.contains(a[i+j]))
            set.remove(a[i+j]);
        else
        {
            found = false;
            break;
        }
    }
    if (found)
    {
        index = i;
        break;
    }
}

如果您可以忍受给定单词的多个实例，那将变得更容易。这假设 b 中的每个单词都是唯一的。

score 1 · Accepted Answer

我可以将此问题视为最小窗口宽度问题的替代方法。这里不是文字，而是文字。

它与杜克林给出的解决方案几乎相同。唯一的附加组件是使用 LinkedHashMap 来跟踪在订单中找到的单词。可以在此处找到 java 解决方案。

这是我的python实现


import collections
def minsubstring(sentence, words):
    sentence = sentence.split(' ')
    mintillnow = sentence
    words = set(words.split(' '))
    found = collections.defaultdict(lambda : [-1,-1])#position of word in the sentence and order of the word
    linked = [] # this together with 'found' provides the functionality of LinkedHashMap
    for i, word in enumerate(sentence):
        if word in words:
            found[word][0] = i
            if found[word][1] != -1:#if this word is already seen, remove it from linked list
                del(linked[found[word][1]])
            linked.append(word)#append the newly found word to the tail of the linked list
            # probably the worst part in this code, updating the indexes back to the map
            for i, wword in enumerate(linked):
                found[wword][1] = i
            # if found all the words, then check if the substring is smaller than the one till now and update
            if len(linked) == len(words):
                startPos = found[linked[0]][0]
                endPos = found[linked[-1]][0]
                if (endPos - startPos + 1) < len(mintillnow):
                    mintillnow = sentence[startPos:endPos + 1]
    return ' '.join(mintillnow)

测试结果


>>> minsubstring('This is a test. This is a programming test. a programming test this is. ','this test a programming')
'a programming test this'

score 0 · Accepted Answer

我认为你可以通过让一个头和一个尾指针不断向内移动直到你不再有匹配然后对另一个做同样的事情并重复整个过程直到它不再向内移动来做到这一点。我可能会尝试稍后对其进行编码。

score 0 · Accepted Answer

我将尝试概述一种更有效的算法。

不要连接字符串。而是在添加时计算字符，即每个单词的长度（）+ 1。

对于子列表，保存起始词、结束词、字符数。

当找到较短的列表时，替换上述值。

编写一个方法来查找以特定元素开头的第一个子列表，并返回子列表的上述定义（开始、结束、字符数）。

使用第一个单词调用上述方法。保存值。使用起始词+ 1 调用方法。找到时冲洗并重复保存较短的值。

您甚至可以通过使用子列表中的第一个词必须是您的搜索词之一这一事实来改进这一点。从 start + 1 开始，您可以简单地查找该元素而不是所有元素，因为它是唯一缺少的元素（仍然需要使用 all 来查找第一个匹配的单词）。如果您在子列表中的结束词之前找到它，则您有一个较小的子列表。如果你在结尾词之后找到它，那就是新的结尾。

这要复杂得多，但可能要快得多。一个常见的权衡。

score 0 · Accepted Answer

public final class MaxStringWindow {

    private MaxStringWindow() {}

    private static void addStringCount(Map<String, Integer> map, String str) {
        if (!map.containsKey(str)) {
            map.put(str, 1);
        } else {
            int val = map.get(str);
            map.put(str, val + 1);
        }
    }

    private static Map<String, Integer> toFindMap(List<String> strList) {
        final Map<String, Integer> toFind  = new HashMap<String, Integer>();
        for (String stri : strList) {
            addStringCount(toFind, stri);
        }
        return toFind;
    }


    public static int minWindowSize(String sentence, List<String> strList) {
        final Map<String, Integer> toFind = toFindMap(strList);
        final Map<String, Integer> hasFound  = new HashMap<String, Integer>();

        int matchCtr = 0;
        boolean matchFound = false;
        String currLeftMostString = null;

        int j = 0; // the trailing position of the sliding window
        int i = 0; // the leading position of the sliding window.

        int min = Integer.MAX_VALUE;

        String[] words = sentence.split(" "); 

        for (i = 0; i < words.length; i++) {

            if (!toFind.containsKey(words[i])) {
                continue;
            }

            if (!matchFound) {
                currLeftMostString = words[i];
                matchFound = true;
                j = i;  
            }

            addStringCount(hasFound, words[i]);

            matchCtr++;

            // check if match has been completed.
            if (matchCtr >= strList.size()) {
                if ((i - j + 1) < min) {
                    min = i - j + 1;
                }
            }

            // does the first element exceed value ?
            if (hasFound.get(currLeftMostString) > toFind.get(currLeftMostString)) {
                // advance the left pointer, such the window (i-j) is as small as possible.    
                while (!toFind.containsKey(words[j]) || hasFound.get(words[j]) > toFind.get(words[j])) {
                    if (hasFound.containsKey(words[j])) {
                        int val = hasFound.get(words[j]);
                        hasFound.put(words[j], val - 1);
                    } 
                    j++;
                }
                currLeftMostString = words[j];
            }   
        }


        if (matchCtr < strList.size()) {
            throw new IllegalArgumentException("The subset is not found in the input string.");
        }

        // note: here we dont do (i-j+1) since i has been incremented additionally in a for loop.
        return min > (i - j) ? i - j : min;
    }

}

java - 查找包含给定单词的最短子字符串的方法：需要优化

8 回答 8

Related

Reference