c++ - 有效地检查一个字符串是否是另一个字符串的近似子字符串（近似约束），直到给定的错误阈值？

Question

取两个 C 或 C++ 中的字符串，s1和s2. 检查一个是否包含另一个是相当简单的。

true如果s2是的子字符串，则以下内容将返回s1。

在 C 中：

strstr(s1, s2)

在 C++ 中：

#include <string>
str.find(str2) != string::npos

带升压：

#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)

我正在寻找的是一种有效的类似方法（在速度方面，而不是内存方面）查找字符串是否大约包含在 / 中是否大约是另一个的子字符串，直到给定的差异阈值。agrep与 Unix 系统中用于查找文件的 bash 命令非常相似。

例如，假设函数被调用approx_contains。如果包含在最大为 4 的 Levenshtein 距离内，则approx_contains(s1, s2, 4)可以返回真/假。s2s1

在网上搜索，我只发现了许多关于如何计算两个字符串之间的 Levenshtein 距离的参考资料和问题，或者是近似字符串模式匹配的理论算法，而不仅仅是检查一个字符串是否包含另一个字符串——这在这里变得很浪费。

非常努力地避免重新发明轮子，有人怎么能在 C 或 C++ 中进行这样的检查？

score 2 · Accepted Answer

另一种选择是库https://github.com/maxbachmann/rapidfuzz-cpp（我是作者），其中包含一个名为partial_ratio. 它执行以下步骤：

搜索两个字符串的最长公共子序列
迭代这些子序列并计算较短字符串和较长字符串的子字符串之间的标准化 Levenshtein 距离，使用 len(shorter string)，从子序列的开头开始
返回最佳对齐的分数

举个例子：

#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0

在此示例中，计算了最佳对齐“txst”<->“test”的相似性。由于只需要一次替换，因此相似度为 75%。

获取编辑距离而不是相对相似度

正如您在评论中指出的那样，您对编辑距离而不是相对相似性感兴趣。这是 partial_ratio 的重新实现，它执行此操作：

template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
  auto s1_view = rapidfuzz::common::to_string_view(s1);
  auto s2_view = rapidfuzz::common::to_string_view(s2);

  if (s1_view.empty() && s2_view.empty()) {
    return 0;
  }

  if (s1_view.empty() || s2_view.empty()) {
    return -1;
  }

  // this swap is performed in the original FuzzyWuzzy implementation, so
  // I use it in my implementation as well. Depending on your goals you
  // might want to remove this swap
  if (s1_view.length() > s2_view.length()) {
    return partial_ratio(s2_view, s1_view, max);
  }

  // Note this is a internal API and so it could change at any time
  // currently this is the slowest part, since my bitparallel
  // implementation of the LCS has a bug and so it is disabled for now
  auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);

  // when there is a full match exit early
  for (const auto& block : blocks) {
    if (block.length == s1_view.length()) {
      return 0;
    }
  }

  // find the edit distance at the places with the longest common subsequence
  std::size_t min_dist = (std::size_t)-1;
  for (const auto& block : blocks) {
    std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
    auto long_substr = s2_view.substr(long_start, s1_view.length());

    // Note you could use a different weight here like
     // e.g. {1, 1, 1} for the normal Levenshtein distance
    // only {1, 1, 1} and {1, 1, 2} have very fast implementations
    std::size_t dist = rapidfuzz::string_metric::levenshtein(
        s1_view, long_substr, {1, 1, 2}, max);

    if (dist < min_dist) {
      max = min_dist = dist;
    }
  }

  return min_dist;
}

s1并且s2可以是任何可以转换为 rapidfuzz::basic_string_view 的类型。一些例子是：

std::basic_string (std::string, ...)
从 C++17 开始的 std::basic_string_view (std::string_string, ...)
c 字符串，如 char*（这必须在创建 string_view 时找到一次长度）
许多其他类型，它们具有data()并.size()使用连续内存。例如 boost::string_view 或 std::vector

对于编辑距离高于最大值(size_t)-1) i的结果，将改为返回。

score 1 · Accepted Answer

我有一个几年前制作的测试程序，但它仍然有效。模式的长度受限于 CPU 寄存器中的位数。此处描述：https ://en.wikipedia.org/wiki/Bitap_algorithm 。在本网站的“另请参阅”部分中，有一个开源库的链接，https://en.wikipedia.org/wiki/TRE_(computing)

#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>

using namespace std;

typedef unsigned int     UINT;
typedef unsigned __int64 UINT64;

// #define WIDECHAR

#if defined(WIDECHAR)
typedef wstring       String;
typedef wchar_t       Char;
#define CIN           wcin
#define COUT          wcout
#else
typedef string        String;
typedef unsigned char Char;
#define CIN           cin
#define COUT          cout
#endif

template<typename T> T inputValue(const char *prompt) {
  for(;;) {
    COUT << prompt;
    T value;
    CIN >> value;
    if(CIN) return value;
    CIN.clear();
  }
}

// https://en.wikipedia.org/wiki/Bitap_algorithm

class ShiftOr {
private:
#if defined(WIDECHAR)
  static constexpr size_t s_masklength = (0xffff + 1);
#else
  static constexpr size_t s_masklength = (0xff + 1);
#endif // WIDECHAR
  UINT64   m_s;
  UINT     m_patternLen;
  UINT64  *m_mask;

  static constexpr size_t s_maskSizeInBytes = s_masklength * sizeof(m_mask[0]);

  void initMask(const UINT64 *src = nullptr) {
    if(m_mask == nullptr) {
      m_mask = new UINT64[s_masklength];
    }
    if(src) {
      memcpy(m_mask, src, s_maskSizeInBytes); // copy all value from src into m_mask
    } else {
      memset(m_mask, -1, s_maskSizeInBytes); // set all bits in m_mask-array to 1
    }
  }
  void deallocMask() {
    delete[] m_mask;
  }
public:
  ShiftOr()
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
  }
  ShiftOr(const ShiftOr &src)
    : m_s(         src.m_s         )
    , m_patternLen(src.m_patternLen)
    , m_mask(      nullptr         )
  {
    if(src.m_mask) {
      initMask(src.m_mask);
    }
  }
  ShiftOr(const String &pattern, bool ignoreCase=false)
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
    compilePattern(pattern, ignoreCase);
  }
  ShiftOr &operator=(const ShiftOr &src) {
    m_s          = src.m_s;
    m_patternLen = src.m_patternLen;
    if(src.m_mask) {
      initMask(src.m_mask);
    } else {
      deallocMask();
    }
    return *this;
  }
  virtual ~ShiftOr() {
    deallocMask();
  }
  void     compilePattern(const String &pattern, bool ignoreCase=false);
  intptr_t search(        const String &str                           ) const;
  intptr_t searchApprox(  const String &str, UINT maxErrors           ) const;
};

void ShiftOr::compilePattern(const String &pattern, bool ignoreCase) {
  m_patternLen = (UINT)pattern.length();
  if(m_patternLen >= 64) {
    throw string("pattern too long for shiftor-search. max length is 63");
  }
  initMask();
  for(UINT i = 0; i < m_patternLen; i++) {
    const Char ch = pattern[i];
    m_mask[ch] &= ~((UINT64)1 << i);
    if(ignoreCase) {
      if(iswlower(ch)) {
        m_mask[_toupper(ch)] &= ~((UINT64)1 << i);
      } else if(isupper(ch)) {
        m_mask[_tolower(ch)] &= ~((UINT64)1 << i);
      }
    }
  }
  m_s = (UINT64)1 << m_patternLen;
}

intptr_t ShiftOr::search(const String &str) const {
  const UINT64 maskEnd  = m_s;
  UINT64       s        = ~1;
  const Char *start = (Char*)str.c_str(), *end = start + str.length();
  for(const Char *cp = start; cp < end;) {
    s = (s | m_mask[*(cp++)]) << 1;
    if((s & maskEnd) == 0) {
      return cp - start - m_patternLen;
    }
  }
  return -1;
}

intptr_t ShiftOr::searchApprox(const String &str, UINT maxErrors) const {
  if(maxErrors == 0) {
    return search(str);
  }
  const UINT64     maskEnd  = m_s;
  vector<UINT64>   s;
  for(UINT i = 0; i < maxErrors + 1; i++) {
    s.push_back((UINT64)~1);
  }
  UINT64 *sfirst = s.data(), *slast = sfirst + maxErrors;
  const Char *firstChar = (Char*)str.c_str(), *lastChar = firstChar + str.length();
  for(const Char *cp = firstChar; cp < lastChar;) {
    const UINT64 mask = m_mask[*cp++];
    UINT64      *sp   = sfirst, olds = *sfirst;
    *sp = (olds | mask) << 1;

    while(sp++ < slast) {
      const UINT64 tmp = *sp;
      /* Substitution is all we care about */
      *sp = (olds & (tmp | mask)) << 1;
      olds = tmp;
    }
    if((*slast & maskEnd) == 0) {
      return cp - firstChar - m_patternLen;
    }
  }
  return -1;
}

int main(int argc, char **argv) {
  for(;;) {
    const String   pattern    = inputValue<String>("Enter pattern:");
    const bool     ignoreCase = inputValue<Char>("Ignore case[yn]:") == 'y';
    const ShiftOr  A(pattern, ignoreCase);
    const UINT     maxErrors  = inputValue<UINT>("Enter maxErrors:");
    for(;;) {
      const String text = inputValue<String>("Enter text:");
      if((text.length() > 0) && text[0] == '!') {
        break;
      }
      const intptr_t i = A.searchApprox(text,maxErrors);
      cout << "result:" << i << endl;
    }
  }
  return 0;
}

c++ - 有效地检查一个字符串是否是另一个字符串的近似子字符串（近似约束），直到给定的错误阈值？

2 回答 2

获取编辑距离而不是相对相似度

Related

Reference