python - 文本分割：将输入与字典中最长的单词匹配的算法

Question

我需要将字符串拆分为单词，以便每个单词都来自字典。还要确保选择左边最长的单词。因此

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

我设法通过从字符串的末尾遍历到开头匹配的最长单词来解决这个问题。但是问题开始困扰我们这些问题......

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.

我试图通过删除遇到错误之前找到的有效段来解决这个问题，即

shareasale => 'share' and 'as' (error = 'ale')

并从字典中删除一次，然后解决问题。所以

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

因此我也设法解决了这个问题。但后来我无法解决这个问题

asignas => 'as' ( error = 'ignas')

然后我的解决方案将从字典中删除'as'并尝试解决它

asignas => 'a' 'sign' (error = 'as')

因为在新的递归调用中，'as' 已从字典中删除。我写的函数在这个链接中。我希望有人可以通过它并帮助我找到更好的算法来解决这个问题，否则建议修改我现有的算法。

score 3 · Accepted Answer

只需进行递归扫描，每次在最后一个单词中添加一个字母（如果它在字典中），并尝试让它开始一个新单词。这意味着在每次通话中，您有 1 或 2 个假设要测试（是否有空格）。当您到达输入的末尾并且您有一组有效的单词时，如果其中的第一个单词比您迄今为止找到的最佳解决方案长，请保存此解决方案。

示例代码：

words=['share','as','a','sale','bla','other','any','sha','sh']
wdict={}
best=[]

def scan(input,i,prevarr):
    global best
    arr=list(prevarr)
    # If array is empty, we automatically add first letter
    if len(arr)<1:
        arr.append(input[0:1])
        return scan(input,i+1,arr)
    # If no more input is available, evaluate the solution
    if i>=len(input):
        # Is the last word a valid word
        if wdict.has_key(arr[-1]):
            # Is there a current best solution?
            if len(best)==0:
                best=arr      # No current solution so select this one
            elif len(arr[0])>len(best[0]):
                best=arr      # If new solution has a longer first word
        return best
    # If the last word in the sequence is a valid word, we can add a space and try
    if wdict.has_key(arr[-1]):
        arr.append(input[i:i+1])
        scan(input,i+1,arr)
        del arr[-1]
    # Add a letter to the last word and recurse
    arr[-1]=arr[-1]+input[i:i+1]
    return scan(input,i+1,arr)

def main():
    for w in words:
        wdict[w]=True
    res=scan('shareasasale',0,[])
    print res

if __name__ == '__main__':
    main()

score 3 · Accepted Answer

本质上，您的问题是一个树问题，在每个级别上，构成树前缀的所有单词都形成分支。不留下字符串任何部分的分支是正确的解决方案。

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

所以在这个例子中，有两个解决方案，但是我们想使用最长的词来选择解决方案，即我们想使用深度优先搜索策略从右边探索树。

所以我们的算法应该：

按长度降序对字典进行排序。
查找当前字符串的所有前缀。如果没有，则返回False。
设置prefix为最长的未探索前缀。
从字符串中删除它。如果字符串为空，我们找到了解决方案，返回所有前缀的列表。
递归到 2。
此分支失败，返回False。

此解决方案的示例实现：

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as

score 1 · Accepted Answer

我将使用的算法类似于：

让你的字典按长度递减排序
构建一个函数，该函数prefix(s)按顺序生成开始的字典单词s
- IE。prefix("informativecow")首先产生“informative”，然后是“inform”，然后是“info”，然后是“in”
构建一个递归函数test(s)：
- 如果s是空字符串，则返回 True
- 对于每个前缀，从 s 中删除前缀并test()使用该字符串调用。如果你找到一个是真的，返回真。如果您没有找到任何正确的，则返回 False。

score 0 · Accepted Answer

有关如何进行英语分词的真实示例，请查看Python wordsegment 模块的源代码。它更复杂一点，因为它使用单词和短语频率表，但它说明了记忆方法。

特别score说明了分割方法：

def score(word, prev=None):
    "Score a `word` in the context of the previous word, `prev`."

    if prev is None:
        if word in unigram_counts:

            # Probability of the given word.

            return unigram_counts[word] / 1024908267229.0
        else:
            # Penalize words not found in the unigrams according
            # to their length, a crucial heuristic.

            return 10.0 / (1024908267229.0 * 10 ** len(word))
    else:
        bigram = '{0} {1}'.format(prev, word)

        if bigram in bigram_counts and prev in unigram_counts:

            # Conditional probability of the word given the previous
            # word. The technical name is *stupid backoff* and it's
            # not a probability distribution but it works well in
            # practice.

            return bigram_counts[bigram] / 1024908267229.0 / score(prev)
        else:
            # Fall back to using the unigram probability.

            return score(word)

如果您替换了 score 函数，以便它为您的字典中较长的匹配返回更高的分数，那么它会按照您的意愿执行。

python - 文本分割：将输入与字典中最长的单词匹配的算法

4 回答 4

Related

Reference