0

我正在处理一些实体匹配问题,我必须检查记录是否引用相同的业务实体,请看下面由管道分隔的两条记录,现在管道两侧的单词引用相同的实体,第一个记录是 Fairvill common,第二个记录是 walmart 901 common。是否有任何字符串匹配函数可以执行这种比较。

我在python中尝试了soundex和fuzzywuzzy,但结果并不那么有帮助,非常感谢任何帮助。

FAIRVILLE NY DPS 7026||WALMART SFAIRVILLUTUSA
WALMART DEPOT 901||PRICEWALMART SLC DRY A0901
4

1 回答 1

0

参考

def fit(self, sentence_pairs):
    """ Estimate of missing probability for each symbol
    Parameters:
        sentence_pairs - list of (original phrase, abbreviation)
    In the abbreviation, all missed symbols are replaced with "-"
    """
    self.missed_counter_ = defaultdict(lambda: Counter())
    self.total_counter_ = defaultdict(lambda: Counter())
    for (original, observed) in sentence_pairs:
        for i, (original_letter, observed_letter) \
                in enumerate(zip(original[self.order:], observed[self.order:])):
            context = original[i:(i+self.order)]
            if observed_letter == '-':
                self.missed_counter_[context][original_letter] += 1
            self.total_counter_[context][original_letter] += 1 

def predict_proba(self, context, last_letter):
    """ Estimate of probability of last_letter being missed after context"""
    if self.order:
        local = context[-self.order:]
    else:
        local = ''
    missed_freq = self.missed_counter_[local][last_letter] + self.smoothing_missed
    total_freq = self.total_counter_[local][last_letter] + self.smoothing_total
    return missed_freq / total_freq
于 2018-11-28T07:31:52.103 回答