algorithm - 大文本中给定关键字的最短短语的长度

Question

这个问题是在一次采访中向我的一位朋友提出的。

给定两个关键字，我们必须在大文本中找到给定关键字的最短短语的长度。关键字可以在该文本中以任何顺序出现。约束：保持高效的数据结构，这样每次文本都不需要为不同关键字的查询解析

例如。关键词：“一”、“四” 文字：“一二三四五四六一”

这里最短的短语是“四六一”而不是“一二三四”

我们想到的解决方案是：用文本的所有单词构建一个 BST。每个节点维护单词的位置。（这将是一个排序列表）当查询来搜索 [O(logn)] 两个词时，找到它们在 [O(n)] 中的位置之间的最小差异，从而使其有效地 [O(nlogn)]。

我们能做得更好吗？

score 2 · Accepted Answer

您可以将哈希表用于反向索引，即从单词（关键字）到它们在文本中位置的排序列表的哈希表。当你得到查询的两个关键字时，查找它们的关联记录就是 O(1) 操作。

找到发生位置之间的最小差异是 O(k) 操作，其中 k 是较长发生列表的长度。在异常情况下，可能是 k 接近于 n，但在实际使用中并非如此（除非您使用“the”和“a”作为两个关键字，但这些类型的词，称为停用词，通常被排除在完整的无论如何文本搜索）。

在通常的设置中，k 与 n 相比非常小，所以这应该非常快，即 O(1) + O（更常见关键字的出现次数）。

score 1 · Accepted Answer

看起来这可以使用Dynamic Programming解决。在不失一般性的情况下，我可以将问题重新表述为：

给定搜索空间S = {s1, s2, ..., sn}，一个针对(si, sj)，我们必须找到(k, l)这样的位置对：

(sk, sl) == (si, sj)

distance(k, l)是最小值。

可以通过以下方式制定该问题的递归解决方案：

Cost(m) =

LARGEST_NUMBER, if m = 0

Min (Cost(m-1), distance(S[m], Latest_si)), if m > 0 and S[m] == sj

Min (Cost(m-1), distance(S[m], Latest_sj)), if m > 0 and S[m] == si

Cost(m-1), if m > 0 and S[m] != (si, sj)

在哪里，

Cost(m)是优化函数。(si, sj)它表示搜索空间中的最小距离S[1:m]。
Latest_si是的最新位置si。
Latest_sj是的最新位置sj。

这可以转换为空间复杂度为to store的O(n)自下而上循环。O(n)Cost

这是上述算法在 Python 中的实现：

def min_phrase (S, si, sj):
  Cost = []
  for i in S:
    Cost.append([len(S), [-1, -1]])

  latest_si = -1
  latest_sj = -1

  for idx, v in enumerate(S):
    if v == si:
      if latest_sj >=0:
        cost = idx - latest_sj
        if cost < Cost[idx - 1][0]:
          Cost[idx] = [cost, [latest_sj, idx]]
        else:
          Cost[idx] = Cost[idx - 1]
      else:
        Cost[idx] = Cost[idx - 1]

      latest_si = idx

    elif v == sj:
      if latest_si >=0:
        cost = idx - latest_si
        if cost < Cost[idx - 1][0]:
          Cost[idx] = [cost, [latest_si, idx]]
        else:
          Cost[idx] = Cost[idx - 1]
      else:
        Cost[idx] = Cost[idx - 1]

      latest_sj = idx

    else:
      Cost[idx] = Cost[idx - 1]

  return Cost[len(S) - 1]


if __name__ == '__main__':
  S = ("one", "two", "three", "four", "five", "four", "six", "one")
  si = "one"
  sj = "four"

  result = min_phrase(S, si, sj)
  if result[1][0] == -1 or result[1][1] == -1:
    print "No solution found"
  else:
    print "Cost: {0}".format(result[0])
    print "Phrase: {0}".format(" ".join(S[result[1][0] : result[1][1] + 1]))

score 0 · Accepted Answer

First split up the text in phrases. Assign a number to each of these phrases. Now each word in the text is present in some of these phrases. Put the phrase lengths in an array. Put the words in a hash table, with the numbers of the phrases in which they are present as an ordered array.

Now when we want the shortest phrase containing two words, first get the two prase-arrays for the words, then do a set intersection, then look up the phrase lengths for the resulting phrase numbers. Pick the shortest.

score 0 · Accepted Answer

我可能错过了重点，但这看起来可以使用String[] textO(n) 中的简单数组而不是一些花哨的数据结构来完成。

1* 将文本加载到数组中。
2* 找到关键字的位置x并跟踪其位置。
3* 找到关键字的位置y并跟踪其位置。
4* 标记 x 和 y 之间的距离。
5* 第一次设置minx = x和miny = y
6* 不断寻找 x 和 y，交替，每次找到新的更小的距离时改变 minx 和 miny 的值。
7* 最后返回以 minx 和 miny 为界的子字符串

algorithm - 大文本中给定关键字的最短短语的长度

4 回答 4

Related

Reference