c++ - Finding the K th lexicographically substring of a given string when the duplicate substrings are allowed

Question

I want to find lexicographically Kth smallest substring of a given string when the duplicate substrings are allowed.

Suppose we are given a string abc then its substrings in lexicographical order are {a,ab,abc,b,c}, now suppose we are given K = 3 then ans is abc.

Now suppose we are given string aaa then all its substrings are {a,a,a,aa,aaa,aa} so now if K = 4 then we output aa.

However I came across the following code on codeforces but i am not able to understand it. any help is greatly appreciated.

char s [MaxN];
bool b [MaxN];
int k, n;

void solve (vector <int> v)
{
   int i, j;
   int64 p, q;
   char c;

   vector <int> w;
   for (c = 'a'; c <= 'z'; c++)
   {
       p = q = 0;
       w.clear ();
       for (j = 0; j < (int) v.size (); j++)
       {
           i = v[j];
           if (s[i] == c)
           {
               w.push_back (i + 1);
               p++;
               q += n - i;
           }
      }
      if (k < q)
          break;
      k -= q;
   }
   assert (c <= 'z');

   putchar (c);
   if (k < p)
       return;
   k -= p;
   solve (w);
}

int main (void)
{
    int i;

    while (scanf (" %s %d", s, &k) != EOF)
    {
         n = (int) strlen (s);
         if (k > ((((int64) n) * (int64) (n + 1)) >> 1LL))
         {
             printf ("No such line.\n");
             continue;
         }
         k--;

         vector <int> v;
         for (i = 0; i < n; i++)
         v.push_back (i);
         solve (v);
         putchar ('\n');
     }
     return 0;
}

Here is the link to question http://codeforces.com/problemset/problem/128/B

score 2 · Accepted Answer

Shubajit Saha 使用 Suffix Array 的方法是不正确的。考虑 string abaab，其排序后的后缀是（从 0 开始）：

2. aab
3. ab
0. abaab
4. b
1. baab

据他说，aab从 suffix 获得的2. aab子串小于a从获得的子串3. ab，这显然是错误的！

正确的方法是使用后缀自动机 + 动态编程。你可以在CP-Algorithm了解它。这是一篇写得很好的关于 Suffix Automaton 的教程。

按字典顺序排列的第 k 个子串对应于后缀自动机中按字典顺序排列的第 k 个路径。因此，在计算每个状态的路径数之后，我们可以很容易地从自动机的根开始搜索第 k 条路径。请注意，此问题允许重复的子字符串，因此不要用 1 初始化每个状态，而是应该使用以该状态结尾的子字符串的数量进行初始化。为此，请参阅Number of occurrences教程中的部分。

这种方法的一个优点是时间复杂度不取决于 k，它仅取决于子字符串的长度（最多为 n）。因此，k 可以远大于问题中的约束（1e10 很好）。

这是我的 C++ 代码：https ://codeforces.com/contest/128/submission/72601519

时间复杂度为O(nlgn)。虽然构建后缀自动机并遍历它O(n)，但我们需要按状态的长度对状态进行排序。我使用 STD 排序，即O(nlgn). 您可以使用计数排序来使其线性化。

score 1 · Accepted Answer

这个问题的标准方法是构造一个给定字符串的后缀数组。后缀数组以字典顺序为我们提供给定字符串的所有后缀。一旦我们拥有给定字符串的所有排序后缀，就会观察到字符串的每个子字符串都是某个后缀的前缀。如果我们按字典序升序遍历每个大小为 l 的后缀的后缀，则得到 l 个子字符串，每个子字符串都小于从字典序上大于我们正在考虑的后缀的后缀获得的子字符串. 因此，我们可以通过腌制子串的数量来轻松找到第 K 个最小的子串。

实际上，后缀数组用于解决这个问题的一个更困难的版本，在该版本中，您将获得 Q 个查找第 K 个子字符串的查询，其中每个查询中的 K 是不同的。

关于您在问题中的代码，由于您只需找到第 k 个子字符串一次，因此解决方案方法本质上使用相同的逻辑但具有不同的实现。

c++ - Finding the K th lexicographically substring of a given string when the duplicate substrings are allowed

2 回答 2

Related

Reference