c++ - 如何在 std::vector 中查找重复项并返回它们的列表？

Question

因此，如果我有一个单词向量，例如：

Vec1 = "words", "words", "are", "fun", "fun"

结果列表：“有趣”、“单词”

我正在尝试确定哪些单词是重复的，并返回它们的 1 个副本的按字母顺序排列的向量。我的问题是我什至不知道从哪里开始，我发现唯一接近它的是std::unique_copy它不能完全满足我的需要。具体来说，我正在输入 astd::vector<std::string>但输出 a std::list<std::string>。如果需要，我可以使用仿函数。

请问有人至少可以把我推向正确的方向吗？我已经尝试阅读 stl 文档，但我现在只是“大脑”受阻。

score 7 · Accepted Answer

在 3 行中（不包括向量和列表的创建，也不包括以可读性为名的多余换行符）：

vector<string> vec{"words", "words", "are", "fun", "fun"};
list<string> output;

sort(vec.begin(), vec.end());
set<string> uvec(vec.begin(), vec.end());
set_difference(vec.begin(), vec.end(),
               uvec.begin(), uvec.end(),
               back_inserter(output));

编辑

解决方案说明：

需要对向量进行排序以便set_difference()以后使用。
该uvec集合将自动保持元素排序，并消除重复。
该output列表将由的元素填充vec - uvec。

score 6 · Accepted Answer

做一个空的std::unordered_set<std::string>
迭代你的向量，检查每个项目是否是集合的成员
如果它已经在集合中，这是重复的，所以添加到您的结果列表中
否则，添加到集合中。

由于您希望每个重复项仅在结果中列出一次，因此您也可以对结果使用哈希集（而不是列表）。

score 5 · Accepted Answer

IMO，Ben Voigt 从一个很好的基本想法开始，但我会告诫不要将他的措辞过于字面化。

特别是，我不喜欢在集合中搜索字符串，如果不存在则将其添加到您的集合中，如果存在则将其添加到输出中的想法。这基本上意味着每次我们遇到一个新词时，我们都会搜索我们的现有词集两次，一次是检查一个词是否存在，一次是因为它不存在而插入它。大多数搜索基本上是相同的——除非其他一些线程在过渡期间改变结构（这可能会产生竞争条件）。

相反，我会首先尝试将它添加到您所看到的单词集中。这将返回 a pair<iterator, bool>，当且仅当值被插入时bool设置为true- 即，以前不存在。这让我们可以将现有字符串的搜索和新字符串的插入合并到一个插入中：

while (input >> word)
    if (!(existing.insert(word)).second)
        output.insert(word);

这也充分清理了流程，因此很容易将测试变成一个仿函数，然后我们可以使用它std::remove_copy_if直接产生我们的结果：

#include <set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>

class show_copies {
    std::set<std::string> existing;
public:
    bool operator()(std::string const &in) {
        return existing.insert(in).second;
    }
};

int main() {
    std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
    std::set<std::string> result;

    std::remove_copy_if(words.begin(), words.end(),
        std::inserter(result, result.end()), show_copies());

    for (auto const &s : result)
        std::cout << s << "\n";
}

根据我是否更关心代码的简单性或执行速度，我可能会使用 anstd::vector而不是setfor 结果，并使用std::sort其次是std::unique_copy产生最终结果。在这种情况下，我可能还会将std::set内部show_copies替换为std::unordered_set：

#include <unordered_set>
#include <iterator>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>

class show_copies {
    std::unordered_set<std::string> existing;
public:
    bool operator()(std::string const &in) {
        return existing.insert(in).second;
    }
};

int main() {
    std::vector<std::string> words{ "words", "words", "are", "fun", "fun" };
    std::vector<std::string> intermediate;

    std::remove_copy_if(words.begin(), words.end(),
        std::back_inserter(intermediate), show_copies());

    std::sort(intermediate.begin(), intermediate.end());
    std::unique_copy(intermediate.begin(), intermediate.end(),
        std::ostream_iterator<std::string>(std::cout, "\n"));
}

这稍微复杂一些（整行更长！）但当/如果单词数量变得非常大时，可能会更快。另请注意，我std::unique_copy主要用于产生可见输出。如果您只想要集合中的结果，您可以使用标准的唯一/擦除习惯用法来获取intermediate.

score 1 · Accepted Answer

就地（无需额外存储）。没有字符串复制（结果列表除外）。一种+一次通过：

#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <algorithm>
using namespace std;

int main() {
        vector<string> vec{"words", "words", "are", "fun", "fun"};
        list<string> dup;

        sort(vec.begin(), vec.end());

        const string  empty{""};
        const string* prev_p = &empty;

        for(const string& s: vec) {
                if (*prev_p==s) dup.push_back(s);
                prev_p = &s;
        }

        for(auto& w: dup) cout << w << ' '; 
        cout << '\n';
}

score 0 · Accepted Answer

您可以使用 std::map 来计算出现次数，然后依靠 std::list::sort 对生成的单词列表进行排序，从而获得一个非常干净的实现。例如：

std::list<std::string> duplicateWordList(const std::vector<std::string>& words) {
    std::map<std::string, int> temp;
    std::list<std::string> ret;
    for (std::vector<std::string>::const_iterator iter = words.begin(); iter != words.end(); ++iter) {
        temp[*iter] += 1;
        // only add the word to our return list on the second copy
        // (first copy doesn't count, third and later copies have already been handled)
        if (temp[*iter] == 2) {
            ret.push_back(*iter);
        }
    }
    ret.sort();
    return ret;
}

使用 std::map 似乎有点浪费，但它可以完成工作。

score 0 · Accepted Answer

这是一个比其他人提出的更好的算法：

#include <algorithm>
#include <vector>

template<class It> It unique2(It const begin, It const end)
{
    It i = begin;
    if (i != end)
    {
        It j = i;
        for (++j; j != end; ++j)
        {
            if (*i != *j)
            { using std::swap; swap(*++i, *j); }
        }
        ++i;
    }
    return i;
}
int main()
{
    std::vector<std::string> v;
    v.push_back("words");
    v.push_back("words");
    v.push_back("are");
    v.push_back("fun");
    v.push_back("words");
    v.push_back("fun");
    v.push_back("fun");
    std::sort(v.begin(), v.end());
    v.erase(v.begin(), unique2(v.begin(), v.end()));
    std::sort(v.begin(), v.end());
    v.erase(unique2(v.begin(), v.end()), v.end());
}

它更好，因为它只需要存储swap没有辅助vector，这意味着它将在早期版本的 C++ 中表现最佳，并且它不需要元素是可复制的。

如果您更聪明，我认为您也可以避免对向量进行两次排序。

c++ - 如何在 std::vector 中查找重复项并返回它们的列表？

6 回答 6

编辑

Related

Reference