2

我的问题是比较两个字符串的最快(质量也很重要,但不太重要)的方法是什么?

我正在寻找比较两个字符串的最有效方法。我比较的一些字符串可能超过 5000 个字符。我正在将大约 80 个字符串的列表与另一个大约 200 个字符串的列表进行比较。它需要很长时间,即使我正在穿线它。我正在使用StringUtils.getLevenshteinDistance(String s, String t)来自 Apache Commons 的方法。我的方法如下。有一个更好的方法吗?

private void compareMe() {
  List<String> compareStrings = MainController.getInstance().getCompareStrings();
  for (String compare : compareStrings) {
    int levenshteinDistance = StringUtils.getLevenshteinDistance(me, compare);
    if (bestScore > levenshteinDistance
          && levenshteinDistance > -1) {
      bestScore = levenshteinDistance; //global variable
      bestString = compare; //global variable
    }
  }
}

这是两个字符串的示例,应该有一个很好的分数:

字符串 1:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = '${request.corp_vendor_id};')

字符串 2:

SELECT 
CORP_VENDOR_NAME as "Corporate Vendor Name",
CORP_VENDOR_REF_ID as "Reference ID",
MERCHANT_ID as "Merchant ID",
VENDOR_CITY as "City",
VENDOR_STATE as "State",
VENDOR_ZIP as "Zip",
VENDOR_COUNTRY as "Country",
REMIT_VENDOR_NAME as "Remit Name",
REMIT_VENDOR_REF_ID as " Remit Reference ID",
VENDOR_PRI_UNSPSC_CODE as "Primary UNSPSC"
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE VENDOR_REFERENCE_ID in 
(SELECT distinct CORP_VENDOR_REF_ID
FROM DSS_FIN_USER.ACQ_VENDOR_DIM
WHERE CORP_VENDOR_REF_ID = 'ACQ-169013')

您会注意到唯一的区别是'${request.corp_vendor_id};'字符串末尾的 。这将导致它26LevenshteinDistance方法中获得分数。

4

1 回答 1

2

您应该考虑比较逻辑中可能的捷径,以避免一些计算。所以如果你想全局最小化 Levenshtein 距离,你甚至不需要计算它,如果字符串大小的差异高于你当前的最佳 Levenshtein 距离。

例如,如果您当前的最佳 Levenshtein 距离是 50,那么您可以避免比较大小为 100 和 180 的两个字符串,因为它们的 Levenshtein 距离至少为 80。

于 2012-05-24T19:51:52.057 回答