c - 最长公共子序列：为什么这是错误的？

Question

int lcs(char * A, char * B)
{
  int m = strlen(A);
  int n = strlen(B);
  int *X = malloc(m * sizeof(int));
  int *Y = malloc(n * sizeof(int));
  int i;
  int j;
  for (i = m; i >= 0; i--)
    {
      for (j = n; j >= 0; j--)
        {
          if (A[i] == '\0' || B[j] == '\0') 
              X[j] = 0;
          else if (A[i] == B[j]) 
              X[j] = 1 + Y[j+1];
          else 
              X[j] = max(Y[j], X[j+1]);
        }
      Y = X;
    }
  return X[0];
}

这行得通，但 valgrind 大声抱怨无效读取。我是怎么弄乱记忆的？对不起，我总是在 C 内存分配上失败。

score 2 · Accepted Answer

这里的问题是你的桌子的大小。请注意，您将空间分配为

int *X = malloc(m * sizeof(int));
int *Y = malloc(n * sizeof(int));

但是，您使用的是索引 0 ... m 和 0 ... n，这意味着 X 中需要 m + 1 个插槽，Y 中需要 n + 1 个插槽。

尝试将其更改为阅读

int *X = malloc((m + 1) * sizeof(int));
int *Y = malloc((n + 1) * sizeof(int));

希望这可以帮助！

score 1 · Accepted Answer

注意：我通常不会写两个答案，如果您觉得它很俗气，请随时对此发表评论并注意投票。这个答案是一个更优化的解决方案，但我想先给出我能想到的最直接的一个，然后把它放在另一个答案中，以免混淆两者。基本上它们是针对不同的受众的。

有效解决此问题的关键是在查找较长的公共子字符串时不要丢弃有关较短公共子字符串的信息。天真地，您检查每个子字符串与另一个子字符串，但如果您知道“AB”与“ABC”匹配，并且您的下一个字符是 C，请不要检查“ABC”是否在“ABC”中，只需检查“AB”之后的位置是“C”。

对于 A 中的每个字符，您必须检查 B 中的所有字母，但是因为一旦不再可能出现更长的子字符串，我们就会停止查看 B，这极大地限制了检查的次数。每次你在前面得到一个更长的匹配，你就消除了后端的检查，因为它不再是一个更长的子字符串。

例如，如果 A 和 B 都很长，但不包含公共字母，则 A 中的每个字母将与 B 中的每个字母进行比较，运行时间为 A*B。

对于有很多匹配项但匹配长度不是较短字符串长度的很大一部分的序列，您可以使用 A * B 组合来检查两个字符串中较短的一个（A 或 B）前导到 A*B*A 或 A*B*B，对于相似长度的字符串，这基本上是 O(n^3) 时间。我真的认为这个解决方案中的优化会比 n^3 更好，即使有三重嵌套的 for 循环，但它似乎不是我能说的最好的。

不过，我正在考虑更多。找到的子字符串不是字符串长度的重要部分，在这种情况下，优化不会做太多，但是 A*B 的每个组合的比较不会随着 A 或 B 缩放并退出是常数 - 或者 - 它们是 A 和 B 的重要部分，它直接除以必须比较的 A*B 组合。

我可能会问这个问题。

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);

  // these hold the position in A of the longest common substring
  int longest_length_found = 0;

  // for each character in one string (doesn't matter which), look for incrementally larger strings in the other
  for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
    for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {

       // offset into each string until end of string or non-matching character is found
      for (int offset = 0; A[a_index+offset] != '\0' && B[b_index+offset] != '\0' && A[a_index+offset] == B[b_index+offset]; offset++) {          
        longest_length_found = longest_length_found > offset ? longest_length_found : offset;
      }
    }
  }
  return longest_found_length;
}

score 1 · Accepted Answer

系列问题。首先，正如 templatetypedef 所说，您分配不足。

然后，正如 paddy 所说，您并没有释放 malloc 的内存。如果您需要该Y=X行，则需要将原始 malloc 的空间地址存储在另一组变量中，以便您可以调用free它们。

...mallocs...
int * original_y = Y;
int * original_x = X;
...body of code...
free(original_y);
free(original_x);
return X[0];

但这并没有解决您的新问题，这就是为什么代码实际上不起作用？

我承认我无法遵循您的代码（没有更多的研究），但我可以提出一种可行且更易于理解的算法。这可能有点伪代码并且不是特别有效，但第一步是正确的。我稍后列出了一些优化。

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);


  // these hold the position in A of the longest common substring
  int longest_found_length = 0;

  // go through each substring of one of the strings (doesn't matter which, you could pick the shorter one if you want)
  char * candidate_substring = malloc(sizeof(char) * length_a + 1);
  for (int start_position = 0; start_position < length_a; start_position++) {
    for (int end_position = start_position; end_position < length_a; end_position++) {

       int substring_length = end_position - start_position + 1;

       // make a null-terminated copy of the substring to look for in the other string
       strncpy(candidate_substring, &(A[start_position]), substring_length);
       if (strstr(B, candidate_substring) != NULL) {
         longest_found_length = substring_length;
       }
    }

  }
  free(candidate_substring);
  return longest_found_length;
}

你可以做一些不同的优化：

       // if this can't be longer, then don't bother checking it.  You can play games with the for loop to not have this happen, but it's more complicated.
       if (substring_length <= longest_found_index) {
         continue;
       }

和

       // there are more optimizations you could do to this, but don't check
       //   the substring if it's longer than b, since b can't contain it.
       if (substring_length > length_b) {
         continue;
       }

和

   if (strstr(B, candidate_substring) != NULL) {
     longest_found_length = end_position - start_position + 1;
   } else {
     // if nothing contains the shorter string, then nothing can contain the longer one, so skip checking longer strings with the same starting character
     break; // skip out of inner loop to next iteration of start_position
   }

您可以使用和字符进行字符交换，而不是将每个候选子字符串复制到新end_position + 1字符串NUL。然后，在 b 中查找该子字符串后，将原始字符end_position+1换回。这会快得多，但会使实现稍微复杂一些。

score 0 · Accepted Answer

除了 templatetypedef 说的，还有几点需要考虑：

为什么X和Y大小不一样？
你为什么要这样做Y = X？这是指针的分配。你也许是说memcpy(Y, X, (n+1)*sizeof(int))？

c - 最长公共子序列：为什么这是错误的？

4 回答 4

Related

Reference