algorithm - Split a string to a string of valid words using Dynamic Programming

Question

I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem:

You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w is a valid word, and has value 0 otherwise.

Give a dynamic programming algorithm that determines whether the string s[*] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming that each call to dict takes unit time.
In the event that the string is valid, make your algorithm output the corresponding sequence of words.

score 9 · Accepted Answer

让压缩文档的长度为 N。

令 b(n) 为布尔值：如果文档可以从文档中的位置 n 开始拆分为单词，则为 true。

b(N) 为真（因为空字符串可以拆分为 0 个单词）。给定 b(N), b(N - 1), ... b(N - k)，您可以通过考虑所有以字符 N - k - 1 开头的单词来构造 b(N - k - 1)。如果有任何这样的词 w，设置 b(N - k - 1 + len(w))，然后将 b(N - k - 1) 设置为真。如果没有这样的词，则将 b(N - k - 1) 设置为 false。

最终，您计算 b(0)，它会告诉您整个文档是否可以拆分为单词。

在伪代码中：

def try_to_split(doc):
  N = len(doc)
  b = [False] * (N + 1)
  b[N] = True
  for i in range(N - 1, -1, -1):
    for word starting at position i:
      if b[i + len(word)]:
        b[i] = True
        break
  return b

您可以采取一些技巧来提高“从位置 i 开始的单词”的效率，但系统会要求您使用 O(N^2) 算法，因此您可以在字典中查找从 i 开始的每个字符串。

要生成单词，您可以修改上述算法以存储好单词，或者像这样生成它：

def generate_words(doc, b, idx=0):
  length = 1
  while true:
    assert b(idx)
    if idx == len(doc): return
    word = doc[idx: idx + length]
    if word in dictionary and b(idx + length):
       output(word)
       idx += length
       length = 1

这里 b 是从算法的第一部分生成的布尔数组。

score 6 · Accepted Answer

正式确定@MinhPham 的建议。

这是一个动态编程解决方案。

给定一个字符串 str，让

b[i] = true 如果子字符串 str[0...i]（包括）可以拆分为有效单词。

在 str 中添加一些起始字符，例如 !，以表示空字。str = "！" + 字符串

基本情况是空字符串，所以

b[0] = 真。

对于迭代情况：

b[j] = true 如果 b[i] == true 并且 str[i..j] 是所有 i < j 的词

score 1 · Accepted Answer

c++中的dp解决方案：

int main()
{
    set<string> dict;
    dict.insert("12");
    dict.insert("123");
    dict.insert("234");
    dict.insert("12345");
    dict.insert("456");
    dict.insert("1234");
    dict.insert("567");
    dict.insert("123342");
    dict.insert("42");
    dict.insert("245436564");
    dict.insert("12334");

    string str = "123456712334245436564";

    int size = str.size();
    vector<int> dp(size+1, -1);
    dp[0] = 0;
    vector<string > res(size+1);
    for(int i = 0; i < size; ++i)
    {
        if(dp[i] != -1)
        {
            for(int j = i+1; j <= size; ++j)
            {
                const int len = j-i;
                string substr = str.substr(i, len);
                if(dict.find(substr) != dict.end())
                {
                    string space = i?" ":"";
                    res[i+len] = res[i] + space + substr;
                    dp[i+len] = dp[i]+1;
                }
            }
        }
    }
    cout << *dp.rbegin() << endl;
    cout << *res.rbegin() << endl;

    return 0;
}

score 1 · Accepted Answer

Dp 很清楚，O(N^2)但是如果您知道字典中的单词，我认为您可以使用一些预计算来更快地获取O(N). 阿霍-科拉西克

score 0 · Accepted Answer

字符串 s[] 可能被拆分为不止一种方式。下面的方法找到我们可以拆分 s[] 的最大单词数。下面是算法的草图/伪代码

bestScore[i] -> 存储可以拆分前 i 个字符的最大单词数（否则为 MINUS_INFINITY）

for (i = 1 to n){
     bestScore[i] = MINUS_INFINITY
     for (k = 1 to i-1){
        bestScore[i] = Max(bestSCore[i], bestScore[i-k]+ f(i,k))
     }
 }

其中 f(i,k) 定义为：

f(i,k) = 1 : if s[i-k+1 to i] is in dictionary
       = MINUS_INFINITY : otherwise

bestScore[n] 将存储 s[] 可以拆分的最大单词数（如果值为 MINUS_INFINIY，则 s[] 不能拆分）

显然运行时间是 O(n^2)

由于这看起来像是一个教科书练习，我不会编写代码来重建实际的分割位置。

score 0 · Accepted Answer

下面是这个问题的 O(n^2) 解决方案。

void findstringvalid() {
string s = "itwasthebestoftimes";
set<string> dict;
dict.insert("it");
dict.insert("was");
dict.insert("the");
dict.insert("best");
dict.insert("of");
dict.insert("times");

vector<bool> b(s.size() + 1, false);
vector<int> spacepos(s.size(), -1);
//Initialization phase
b[0] = true; //String of size 0 is always a valid string
for (int i = 1; i <= s.size(); i++) {
    for (int j = 0; j <i; j++) {
       //string of size s[ j... i]
       if (!b[i]) {
           if (b[j]) {
              //check if string "j to i" is in dictionary
              string temp = s.substr(j, i - j);
              set<string>::iterator it = dict.find(temp);
              if (it != dict.end()) {
                  b[i] = true;
                  spacepos[i-1] = j;
              }
           }
        }
    }
}
if(b[s.size()])
    for (int i = 1; i < spacepos.size(); i++) {
        if (spacepos[i] != -1) {
            string temp = s.substr(spacepos[i], i - spacepos[i] + 1);
            cout << temp << " ";
    }
    }
}

algorithm - Split a string to a string of valid words using Dynamic Programming

6 回答 6

Related

Reference