鉴于以下情况,我可以找到最长的公共子字符串:
s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
print longest_common_substring(s1, s2)
[出去]:
foo bar
但是我如何确保最长的公共子字符串尊重英语单词边界并且不切分单词?例如,以下句子:
s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)
输出不需要的后续内容,因为它分解了kappa
s2 中的单词:
a foo bar
所需的输出仍然是:
foo bar
我还尝试了一种 ngram 方法来获取关于单词边界的最长公共子字符串,但是还有其他方法可以在不计算 ngrams 的情况下处理字符串吗?(见答案)