c - C 代码 - 从 2D 字符数组中删除重复单词的逻辑

Question

嗨，有一个 C 代码，其中我有一个 2D 字符数组 -

names[100][20] //Currently maximum 100 names, each of 19 characters supported

这个数组被一些带有名字的逻辑填充。我在变量 names_found 中跟踪实际找到的名称总数（可能少于 100 个名称）。

现在我想删除可能存在的重复名称。我打算做的是类似的事情。

for(i=0;i<names_found;i++)
{
    for(j=i+1;j<names_found;j++)
    {
       //Then compare(strcmp) each string/name with every other.
       //e.g. if there were 4 names the comparisons done would be
       //{name[0],name[1]},{name[0],name[2]},{name[0],name[3]}
       //{name[1],name[2]} , {name[1],name[3]}
       //& {name[2],name[3]}
       //And then some more logic to remove duplicate based on result of strcmp    results. Don't know what this logic would look like to store the result in place, in same 2D character buffer?

     }

这
是重复词删除的逻辑，我在做什么正确，功能上？

我怎样才能优化它的速度。

任何更好/更快的解决方案。

score 1 · Accepted Answer

有很多方法和方法可以更快地做到这一点，但对于这么小的一组来说不一定。此外，删除名称的逻辑可能需要比您想象的更长的时间，因为它会导致您必须解决的数组中的空白，或者您需要将您的答案向下 memmove() 以填补空白。

Boyer-Moore 类型搜索可能会加快速度，但取决于 strcmp 函数的速度，由于设置查找等的开销，您可能不会从中获得任何好处。如果您设置正确，您可能可以使用 strstr() 代替您的搜索，这可能确实使用了更高级的搜索算法。

基本上，您的集合是如此之小，以至于优化在这里可能有点过早。

score 1 · Accepted Answer

This is a simple approach. It assumes that the order of the names isn't important:

for (i = 0; i < names_found; i ++)
{
    j = i + 1;
    while (j < names_found)
    {
        if (strcmp(names[i], names[j]) == 0)
        {
            memmove(names + j, names + (names_found - 1), sizeof(names[0]));
            -- names_found;
        }
        else
            ++ j;
    }
}

score 0 · Accepted Answer

It is logically ok: for each array element, search if there are other equal in the following elements, and if it is so, remove them; but you need to change dynamically the array size; e.g. if you remove 3 duplicates of the first element, then the remaining number of elements is smaller than names_found, so that you need to update it accordingly.

You can make it faster if you sort the array (with a fast sorting algorithm, but it may depend on the size of the data) and then duplicates are all "side by side". Using a destination array would be faster, since if you find N duplicates, you do not need to move all the other array element back by N positions (in the worst case you need an array of the same size of the source array).

Another approach is to use a hash container; in this case you need a library (e.g. glib has an hashtable "object"), and your code would look different (e.g. you can "skip" duplicates when you populate the hashtable).

c - C 代码 - 从 2D 字符数组中删除重复单词的逻辑

3 回答 3

Related

Reference