3

鉴于以下情况,我可以找到最长的公共子字符串:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[出去]:

foo bar

但是我如何确保最长的公共子字符串尊重英语单词边界并且不切分单词?例如,以下句子:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出不需要的后续内容,因为它分解了kappas2 中的单词:

a foo bar

所需的输出仍然是:

foo bar

我还尝试了一种 ngram 方法来获取关于单词边界的最长公共子字符串,但是还有其他方法可以在不计算 ngrams 的情况下处理字符串吗?(见答案)

4

9 回答 9

9

这太简单了,难以理解。我用你的代码完成了 75% 的工作。我首先将句子拆分为单词,然后将其传递给您的函数以获取最大的公共子字符串(在这种情况下它将是最长的连续单词),因此您的函数给了我 ['foo', 'bar'],我加入了该数组的元素以产生所需的结果。

这是在线工作副本,供您测试和验证并摆弄它。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边缘案例

  1. '。' 和 '?' 如果最后一个单词和标点符号之间有空格,也被视为有效单词。如果您不留空格,它们将被计为最后一个单词的一部分。在那种情况下,“羊”和“羊?” 不再是同一个词了。在调用此类函数之前,由您决定如何处理此类字符。在这种情况下

    import re
    s1 = re.sub('[.?]','', s1)
    s2 = re.sub('[.?]','', s2)

然后像往常一样继续。

于 2014-04-14T16:29:43.070 回答
1

只需在您的代码中添加一个接受条件:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)
于 2014-04-14T19:44:01.533 回答
1

这是一个更有趣的问题,然后我最初认为它是可信的。当你考虑它时,有 4 种可能的结果。

  1. 琐碎的情况,整个字符串匹配没有边界(你的第一个例子)
  2. 在开头越过单词边界(您的第二个示例)
  3. 在末尾越过单词边界
  4. 每端都有一个单词边界

现在您的代码处理了这个简单的情况,因此我们可以利用它;剩下的就是将您的结果包装在其他情况下的一些检查中。那么这些检查应该是什么样的呢?让我们以您的失败案例为例:

string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"

因此,从字符串的find角度来看,我们可以string1在and中按该顺序找到所有这些字母string2,但是如果我们将空格周围的所有内容分隔到列表中,并且仅按顺序查找列表string1将匹配。

现在我主要是一个 C 人,所以我想把它写在一个函数中:

def full_string(str1, str2, chkstr):
  l1 = str1.split()
  l2 = str2.split()
  chkl = chkstr.split()
  return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
          any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

使用此函数,我们可以检查两个字符串中的任何一个是否不包含longest_common_substring(s1, s2)按顺序排列的结果的所有单词。完美的。所以最后一步是结合这两个函数并检查上面列出的 4 种情况:

def longest_whole_substring(s1, s2):
  subs = longest_common_substring(s1, s2)
  if not full_string(s1, s2, subs):
    if full_string(s1, s2, ' '.join(subs.split()[1:])):
      subs = ' '.join(subs.split()[1:])
    elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
      subs = ' '.join(subs.split()[:-1])
    else:
      subs = ' '.join(subs.split()[1:-1])
  return subs

现在该函数longest_whole_substring(s1, s2)将提供最长的完整子字符串,不会截断任何单词。让我们在每种情况下进行测试:

琐碎的:

>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

开头的词边界:

>>> b = 's a foo bar'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar '

最后的字边界:

>>> b = 'foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

和两端的字边界:

>>> b = 's a foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar'

看起来不错!

于 2014-04-17T03:12:50.660 回答
1

您需要做的就是添加对单词开头和结尾的检查。

然后,您只更新m有效的匹配结束。

像这样:

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
        else:
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]
于 2014-04-17T12:08:16.947 回答
1

我的回答没有来自任何官方来源,而只是一个简单的观察:至少在我的安装中,LCS 函数的输出与 (s1, s2) 和 (s1, s3) 对上的输出有所不同:

In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '

您可能会注意到,如果匹配了完整的单词,那么周围的空格也会匹配

然后,您可以在返回其输出之前修改该函数,如下所示:

answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
else:
    return answer

我敢肯定还有其他边缘情况,例如出现在字符串末尾的子字符串,使用s1or递归调用函数s2,是否修剪answer正面或背面等等 - 但至少在您展示的情况下,这个简单的修改做你想要的:

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '

你觉得这个方向值得探索吗?

于 2014-04-14T16:12:34.537 回答
1

我递归地做到了:

def common_phrase(self, longer, shorter):
""" recursively find longest common substring, consists of whole words only and in the same order """
if shorter in longer:
    return shorter
elif len(shorter.split()) > 1:
    common_phrase_without_last_word = common_phrase(shorter.rsplit(' ', 1)[0], longer)
    common_phrase_without_first_word = common_phrase(shorter.split(' ', 1)[1], longer)
    without_first_is_longer = len(common_phrase_without_last_word) < len(common_phrase_without_first_word)

    return ((not without_first_is_longer) * common_phrase_without_last_word +
            without_first_is_longer * common_phrase_without_first_word)
else:
    return ''

只需在应用之前将两个字符串分类为“更短”和“更长”:

if len(str1) > len(str2):
    longer, shorter = str1, str2 
else:
    longer, shorter = str2, str1
于 2015-07-20T10:10:57.950 回答
0

这是一个ngram方式:

def ngrams(text, n):
  return [text[i:i+n] for i in xrange(len(text)-n)]

def longest_common_ngram(s1, s2):
  s1ngrams = list(chain(*[[" ".join(j) for j in ngrams(s1.split(), i)] 
                          for i in range(1, len(s1.split()))]))
  s2ngrams = list(chain(*[[" ".join(j) for j in ngrams(s2.split(), i)]
                          for i in range(1, len(s2.split()))]))

  return set(s1ngrams).intersection(set(s2ngrams))
于 2014-03-29T02:57:40.853 回答
0

查找最长公共子串的一种有效方法是后缀树(参见http://en.wikipedia.org/wiki/Suffix_treehttp://en.wikipedia.org/wiki/Longest_common_substring_problem)。我看不出有任何理由不能使用单词而不是字符来创建后缀树,在这种情况下,从树中提取的最长公共子序列将尊重标记边界。如果您想在一个固定字符串和大量其他字符串之间找到公共子字符串,这种方法将特别有效。

有关Python 后缀树实现的列表,请参阅 Python 的公认答案:通用后缀树库。

于 2014-04-16T19:16:18.593 回答
0
from difflib import SequenceMatcher
def longest_substring(str1, str2):
    # initialize SequenceMatcher object with
    # input string
    # below logic is to make sure word does not get cut
    str1 = " " + str1.strip() + " "
    str2 = " " + str2.strip() + " "
    seq_match = SequenceMatcher(None, str1, str2)

    # find match of longest sub-string
    # output will be like Match(a=0, b=0, size=5)
    match = seq_match.find_longest_match(0, len(str1), 0, len(str2))

    # return longest substring
    if match.size != 0:
        lm = str1[match.a: match.a + match.size]
        # below logic is to make sure word does not get cut
        if not lm.startswith(" "):
            while not (lm.startswith(" ") or len(lm) == 0):
                lm = lm[1:]
        if not lm.endswith(" "):
            while not (lm.endswith(" ") or len(lm) == 0):
                lm = lm[:-1]
        return lm.strip()
    else:
        return ""
于 2020-09-05T10:28:25.220 回答