algorithm - 查找有多少子字符串的第一个和最后一个字符在内部重复的快速方法

Question

这是关于我创建的子字符串的问题。我想知道如何O(nlog(n))解决这个问题，因为天真的方法很容易。这是怎么回事。你有一个字符串S。S有很多子串。在某些子字符串中，第一个字符和最后一个字符不止一次出现。找出第一个和最后一个字符不止一次的子串。

Input: "ABCDCBE"
Expected output: 2
Explanation: "BCDCB" and "CDC" are two such substrings

该测试用例解释只有“BCDCB”和“CDC”，其中第一个和最后一个字符相同。

除了示例情况之外，可能还有另一种情况，其中“ABACAC”是第一个字符“A”出现 3 次，最后一个字符“C”出现两次的子字符串。“AAAABB”也是另一个子字符串。

“AAAAB”不满足。

我所了解到的O(nlog(n))可能会或可能不会有助于解决方案的是二叉索引树。二叉索引树可以以某种方式用来解决这个问题。还有排序和二分搜索，但首先我想特别关注二叉索引树。

我正在寻找一个O(n log(n))或更好的空间复杂度。

字符也是 UTF-16

score 1 · Accepted Answer

我的解决方案的要点如下：

遍历输入数组，并为每个位置计算以该位置结尾的“有效”子字符串的数量。这些值的总和是有效子字符串的总数。我们通过使用二叉索引树计算当前位置之前的子字符串的有效开始数量来实现这一点。

现在了解完整的细节：

当我们遍历数组时，我们将当前元素视为子字符串的结尾，并且我们说作为有效开始的位置是这样的位置，即它的值再次出现在它和我们当前迭代的位置之间。（即如果子字符串开头的值至少出现两次）

例如：

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7

第一个1（在 index 处0）是一个有效的开始，因为在它之后但在当前索引（index ）之前还有另一个1（在 index 处）。46

现在，计算当前索引之前出现的有效开始的数量给了我们非常接近我们想要的东西，除了我们可能会抓取一些子字符串的最后一个值没有两次出现的子字符串（即我们目前正在迭代）

例如：

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7
                  ^--------^

在这里，4 被标记为有效的开始（因为4在它之后还有另一个），但对应的子字符串没有两个3s。

为了解决这个问题，我们将只考虑有效启动到当前值的先前出现。（这意味着子字符串将包含当前值及其先前的外观，因此，最后一个元素将至少两次出现在子字符串中）

伪代码如下：

fn solve(arr) {
  answer := 0
  for i from 1 to length(arr) {
    previous_index := find_previous(arr, i)

    if there is a previous_index {
      arr[previous_index].is_valid_start = true
      answer += count_valid_starts_up_to_and_including(arr, previous_index)
    }
  }
  return answer
}

为了有效地实现这些操作，我们使用哈希表来查找值的先前位置，并使用二进制索引树 (BIT) 来跟踪和计算有效位置。

因此，更充实的伪代码看起来像

fn solve(arr) {
  n := length(arr)

  prev := hash_table{}
  bit  := bit_indexed_tree{length = n}

  answer := 0
  for i from 1 to length(arr) {
    value := arr[i]
    previous_index := prev[value]

    if there is a previous_index {
      bit.update(previous_index, 1)
      answer += bit.query(previous_index)
    }

    prev[value] = i
  }
  return answer
}

最后，由于伪代码并不总是足够的，这里有一个 C++ 实现，其中控制流有点混乱，以确保有效使用std::unordered_map（C++ 的内置哈希表）

class Bit { 
    std::vector<int> m_data;
public:
    // initialize BIT of size `n` with all 0s
    Bit(int n);

    // add `value` to index `i`
    void update(int i, int value);

    // sum from index 0 to index `i` (inclusive)
    int query(int i);
};

long long solve (std::vector<int> const& arr) {
    int const n = arr.size();

    std::unordered_map<int, int> prev_index;
    Bit bit(n);

    long long answer = 0;
    int i = 0;
    for (int value : arr) {

        auto insert_result = prev_index.insert({value, i});
        if (!insert_result.second) { // there is a previous index
            int j = insert_result.first->second;

            bit.update(j, 1);
            answer += bit.query(j);

            insert_result.first->second = i;
        }

        ++i;
    }

    return answer;
}

编辑：为了透明，这里是我用来测试这段代码的 Fenwick 树实现

struct Bit {
    std::vector<int> m_data;
    Bit(int n) : m_data(n+2, 0) { }
    int query(int i) {
        int res = 0;
        for(++i; i > 0; i -= i&-i) res += m_data[i];
        return res;
    }
    void update(int i, int x) {
        for(++i; i < m_data.size(); i += i&-i) m_data[i] += x;
    }
};

algorithm - 查找有多少子字符串的第一个和最后一个字符在内部重复的快速方法

1 回答 1

Related

Reference