multidimensional-array - 在 Salesforce 中优化 Levenshtein 距离算法

Question

我有一个名为 customer 的自定义对象，其中包含 Customer_Name、Address_Line_1、Post_Code 等字段。

我想浏览所有记录并比较 Customer_Name 的相似性（基于模糊搜索或 levenshtein 距离）。如果相似度高于或低于某个阈值，则会更新自定义字段 (Possible_Duplicate_Customer_ID__c) 以识别可能的重复项。

我已经设法实现了这一点，但我遇到了两个问题：

1）。超出 Salesforce 控制器限制（脚本语句过多：200001）可能是由 Levenshtein 距离算法所需的大量循环引起的。2）。我提交的列表（newList）也包含重复的 ID。

    private static List<Customer__c> newList = new List<Customer__c>();

webService static Integer findDupes() {

    Integer returnCount = 0;
    Double cost = 0;
    Integer COST_THRESHOLD = 5;

    Map<id,Customer__c> cMap = new Map<id,Customer__c>([
        select ID, Name, Customer_Name__c, Possible_Duplicate_Customer_ID__c 
        from Customer__c 
    ]);

    List<Customer__c> custList1 = cMap.values();        
    List<Customer__c> custList2 = custList1.clone();

    for (Customer__c cust1 :custList1) {
        for (Customer__c cust2 :custList2) {
            cost = LevenshteinDistance.computeLevenshteinDistance(
                    cust1.Customer_Name__c, cust2.Customer_Name__c);
                if(cost<COST_THRESHOLD && cost != 0) {
                    Customer__c c = new Customer__c(
                        id = cust2.Id, 
                        Possible_Duplicate_Customer_ID__c = cust1.Name
                    );
                    newList.add(c);
                }
                System.debug(cost+' edits to transform '
                        +cust1.Customer_Name__c+' to '+cust2.Customer_Name__c);
        }
    }

    returnCount = newList.size();

    update newList;        
    return returnCount;
}

score 2 · Accepted Answer

你试过新的方法getLevenshteinDistance 吗String？

另请参阅我的问题/方法here。我坚持认为只有在同一国家/地区的匹配会返回相同的邮政编码或城市，从而减少初始匹配的数量。

score 1 · Accepted Answer

我建议在使用可批处理接口的类中运行代码，这更适合处理大量数据。由于您的 Web 服务不使用输入，您可以按计划每小时运行一次批处理，通过标记记录来标记重复项，然后在 Web 服务中提取这些记录。当然，如果你需要它是实时的，你需要优化这个循环。

至于更新列表中的重复 ID，您对cust2.Id更新的使用应该考虑到这一点，但您似乎并没有防止将客户记录与其自身进行比较的情况！这应该解决它：

for (Customer__c cust1 :custList1) {
    for (Customer__c cust2 :custList2) {
        if (cust1.Id == cust2.Id) {
            continue;
        }

score 0 · Accepted Answer

Lev distance is a great tool for fuzzy matching, but basically unusable in Apex due to the script statement limits. Using a version I've got laying around (adapted from an old version of Apex Lang), comparing "0123456789" to "0246803579" takes 700+ script statements. Comparing "actual resource usage has basically no correlation with number of lines of code executed" to "yeah but annoying a 'few' advanced developers will allow us to cut corners during governor limit implementation" takes 60,000 script statements. Unless you're doing small numbers of small comparisons, or have somehow rewritten Lev to be more script statement friendly, it's going to be difficult to justify on the platform.

I've taken to using cheaper proxies for Lev in Apex, like Soundex for name or short word comparison, or fancy dynamic SOQL "LIKE" statements. If what you're trying to do can somehow be distilled into set operations, those give you a good bang for the buck in Apex since .contains only costs you one script execution.

If you really need to do lots of Lev, you may have to resort to using the API or rewriting the code to be a lot more line-compact. Pushing computation into the browser may also be an option depending on your use case.

multidimensional-array - 在 Salesforce 中优化 Levenshtein 距离算法

3 回答 3

Related

Reference