nlp - 匹配包含带有排列的单词的行

Question

假设您有一个包含 varchar 列的大表。

您将如何匹配在 varchar col 中包含“首选”一词但数据有些嘈杂并且偶尔包含拼写错误的行，例如：

['$2.10 Cumulative Convertible Preffered Stock, $25 par value',
'5.95% Preferres Stock',
'Class A Preffered',
'Series A Peferred Shares',
'Series A Perferred Shares',
'Series A Prefered Stock',
'Series A Preffered Stock',
'Perfered',
'Preffered  C']

上述拼写错误中“首选”一词的排列似乎表现出家族相似性，但它们几乎没有共同点。请注意，拆分每个单词并在每一行中的每个单词上运行levenshtein将非常昂贵。

更新：

还有其他几个这样的例子，例如“restricted”：

['Resticted Stock Plan',
'resticted securities',
'Ristricted Common Stock',
'Common stock (restrticted, subject to vesting)',
'Common Stock (Retricted)',
'Restircted Stock Award',
'Restriced Common Stock',]

score 1 · Accepted Answer

我可能会做这样的事情——如果你能摆脱 Levenshtein 一次——这是Peter Norvig 的一个惊人的拼写检查器实现：

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

他在这里提供了一个训练集：http://norvig.com/big.txt这是示例输出：

>>> correct('prefferred')
'preferred'
>>> correct('ristricted')
'restricted'
>>> correct('ristrickted')
'restricted'

在您的情况下，您可以将原始列复制到新列，但在执行时将其通过拼写检查器。然后fulltext在拼写正确的列上放置一个索引，并将您的查询与它匹配，但从原始列返回结果。您只需执行一次，而不是每次都计算距离。您也可以对输入进行拼写检查，或者仅将更正后的版本检查为备用。无论哪种方式，都值得研究 Norvig 示例。

score 1 · Accepted Answer

您能否尝试在表格的一小部分样本上对其进行训练以查找可能的拼写错误（使用 split + Levenshtein），然后在整个表格中使用生成的单词列表？

score 1 · Accepted Answer

正在尝试用 TSQL 或什么语言来做到这一点？

您可能可以使用正则表达式来匹配其中的大多数。

以下的一些变化

"p(er|re|e)f{1,2}er{1,2}ed"

"r(e|i)s?t(ri|ir|rti|i)ct?ed"

你想确保它不是大写敏感的......

score 1 · Accepted Answer

再创建两个表，拼写和可能的拼写：

-- 你可以找出类型

create table spelling ( id, word ) ; 
create table possible_spelling 
( id, spelling_id references spelling(id), spelling ) 
-- possible spelling also includes the correct spelling
-- all values are lowercase

insert into spelling( word ) values ('preferred');
insert into possible_spelling( spelling_id, spelling ) 
 select 1, '%preferred%' union select 1, '%prefered%' union ....;

select * 
from bigtable a 
join possible_spelling b
on (lower(a.data) like b.spelling )
join spelling c on (b.spelling_id = c.id) 
where c.word = 'preferred';

反对意见：这会很慢，并且需要设置。回答：没那么慢，这应该是一次性对数据进行分类和修复的事情。设置一次，每个传入行一次进行分类。

nlp - 匹配包含带有排列的单词的行

4 回答 4

Related

Reference