另一种选择是库https://github.com/maxbachmann/rapidfuzz-cpp(我是作者),其中包含一个名为partial_ratio
. 它执行以下步骤:
- 搜索两个字符串的最长公共子序列
- 迭代这些子序列并计算较短字符串和较长字符串的子字符串之间的标准化 Levenshtein 距离,使用 len(shorter string),从子序列的开头开始
- 返回最佳对齐的分数
举个例子:
#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0
在此示例中,计算了最佳对齐“txst”<->“test”的相似性。由于只需要一次替换,因此相似度为 75%。
获取编辑距离而不是相对相似度
正如您在评论中指出的那样,您对编辑距离而不是相对相似性感兴趣。这是 partial_ratio 的重新实现,它执行此操作:
template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
auto s1_view = rapidfuzz::common::to_string_view(s1);
auto s2_view = rapidfuzz::common::to_string_view(s2);
if (s1_view.empty() && s2_view.empty()) {
return 0;
}
if (s1_view.empty() || s2_view.empty()) {
return -1;
}
// this swap is performed in the original FuzzyWuzzy implementation, so
// I use it in my implementation as well. Depending on your goals you
// might want to remove this swap
if (s1_view.length() > s2_view.length()) {
return partial_ratio(s2_view, s1_view, max);
}
// Note this is a internal API and so it could change at any time
// currently this is the slowest part, since my bitparallel
// implementation of the LCS has a bug and so it is disabled for now
auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);
// when there is a full match exit early
for (const auto& block : blocks) {
if (block.length == s1_view.length()) {
return 0;
}
}
// find the edit distance at the places with the longest common subsequence
std::size_t min_dist = (std::size_t)-1;
for (const auto& block : blocks) {
std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
auto long_substr = s2_view.substr(long_start, s1_view.length());
// Note you could use a different weight here like
// e.g. {1, 1, 1} for the normal Levenshtein distance
// only {1, 1, 1} and {1, 1, 2} have very fast implementations
std::size_t dist = rapidfuzz::string_metric::levenshtein(
s1_view, long_substr, {1, 1, 2}, max);
if (dist < min_dist) {
max = min_dist = dist;
}
}
return min_dist;
}
s1
并且s2
可以是任何可以转换为 rapidfuzz::basic_string_view 的类型。一些例子是:
- std::basic_string (std::string, ...)
- 从 C++17 开始的 std::basic_string_view (std::string_string, ...)
- c 字符串,如 char*(这必须在创建 string_view 时找到一次长度)
- 许多其他类型,它们具有
data()
并.size()
使用连续内存。例如 boost::string_view 或 std::vector
对于编辑距离高于最大值(size_t)-1) i
的结果,将改为返回。