c++ - 许多字符串中的相似字符串

Question

我有大量的字符串说 N ，我必须从中找出相似的字符串集。
例子：

输入：输出：
programmable stackover tree stackoverflow trie program oddoneout

set 1: programmable program

set 2: stackoverflow stackover

set 3: tree trie

set 4: oddoneout

那么，什么应该是有效algorithm的（在空间和时间上）？

1）使用 levenshtein 距离是一个好方法，但我们仍然需要为每个字符串搜索所有 n-1 个字符串。

2）使用 trie 可能很好（就前缀而言），但不是最好的方法，因为它无法比较像 al orithmg和 al qkefgjwfjfwfkvfvjs 这样的字符串，它们根本不相似。

similarity of 2 strings: 1) the less the number of different characters in both , more similar are they 2)one string can be converted/transformed into another by just changing , adding some characters in one or both strings
请分享您的观点。

请不要发布有关外部软件等的信息。

score 0 · Accepted Answer

你能做一个基于点的系统，让每个匹配的字符得分 1，比如说，和其他类似的发音字母（或在键盘上接近它的字母或接近的语音得到 0.5 或其他东西），而其他不匹配的得到零。

所以，你有tree并且你想找到相似的词。

program得分 1，因为只有 r 匹配在正确的位置。

trie得到 3。

例如，也许像trwe得到 3.5 这样的东西。

但是然后你会带着容忍度来看待分数。这种容忍度将决定你希望它有多接近。

但这确实取决于您要寻找的内容。

这完全是空穴来风，所以不确定它的效果如何。只是一个想法。

score 0 · Accepted Answer

Your constraints about similarity of 2 strings sounds like the edit distance problem:

http://en.wikipedia.org/wiki/Levenshtein_distance

You can obtain the minimum edit distance between two strings by an Dynamic Programming Algorithm in O(NxM) where N and M are the length of each string.

You can set a threshold number that say "how similar have to be your strings", after setting this number, you can try an all against all algorithm that to check every possible minimum edit distance between all strings. I think you can make the sets with that information

If you know that the strings in your problem will be short (say length < 100), this approach could be a good solution.

Edit:

Let K be the number of strings you want to clasify in sets and let N be the length avarage of your strings. The complexity of the algorithm that I'm proposing is O((K^2)x(N^2)).

(that's why we want N to be a small number)

c++ - 许多字符串中的相似字符串

2 回答 2

Related

Reference