2

给定两个字符串,我希望能够 - 在 Python 中 - 能够确定在两者之间添加了哪些单词以及删除了哪些单词。我见过 difflib,但显然它做不到。

例如:给定 'hello my name is' 和 'hello my guy is',它将返回 ['guys'] 作为添加的单词,并将 ['name'] 作为删除的单词。非常感谢。

编辑:可能我给出的例子不是最好的。它也应该在当前文本和新文本之间没有空格的情况下工作。也许使用 difflib 来获取所有新部分,然后使用正则表达式“\b”进行拆分。我会试一试。

4

3 回答 3

2

关于 python,首先要记住的是它有“包括电池”。这意味着您应该在标准库中查找一个工具来完成您需要的工作,然后再自己重新发明它。

更强大的技术是重新使用difflib.SequenceMatcher来查找字符串中的差异。例子:

import difflib

before = 'hello my name is'
after = 'hello my guys is'

def isjunk(string):
    "Return True if we don't care about this string"
    return string == ' '


s = difflib.SequenceMatcher(isjunk)
s.set_seqs(before, after)

for (
        opcode,
        before_start, before_end,
        after_start, after_end
) in s.get_opcodes():
    if opcode == 'equal':
        # We don't care.
        continue

    print "%7s '%s' -> '%s'" % (
            opcode,
            before[before_start:before_end],
            after[after_start:after_end],
    ) 

这会产生此输出,显然可以对其进行自定义以完全满足您的需要:

replace 'name' -> 'guys'
于 2012-04-09T16:01:06.610 回答
0
before = "hello my name is"
after = "hello my  guy is test"


before = before.split(' ')
after = after.split(' ')

for item in after:
    if not item in before:
        print item
于 2012-04-09T14:39:55.817 回答
0

这不是特别漂亮,但似乎适用于我能想到的大多数情况。我相信这也可以整理很多,并且应该很容易区分大小写。

def freqs(list):
    words = {}
    for word in list:
        words[word] = words.get(word, 0) + 1
    return words

def added_and_removed(a, b):
    af = freqs(a.split())
    bf = freqs(b.split())

    removed = []
    added = []

    for key in af:
        num = bf.get(key)
        if num == None:
            if af[key] > 1:
                words = [key]*af[key]
                removed.extend(words)
            else:
                removed.append(key)

    for key in bf:
        num = af.get(key)
        if num == None:
            added.append(key)
        elif num > 1:
            words = [key]*(num-1)
            removed.extend(words)

    return added, removed

a = 'hello hello hello my name is Dave dave bar foo'
b = 'hello my guys is test easy rob dave beef foo'     

added, removed =  added_and_removed(a, b)
print added
print removed

['beef', 'rob', 'easy', 'test', 'guys']
['bar', 'name', 'Dave', 'hello', 'hello']
于 2012-04-09T15:39:33.990 回答