python - 在字符串列表中查找重复模式

Question

我正在寻找一种从最长重复模式中清除字符串的方法。

我有一个大约 1000 个网页标题的列表，它们都有一个共同的后缀，即网站名称。

他们遵循这种模式：

['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
 ...
]

我怎样才能自动从它们的公共后缀中删除所有字符串" | expand knowledge"？

谢谢！

编辑：对不起，我没有让自己足够清楚。我事先没有关于" | expand knowledge"后缀的信息。我希望能够清除潜在公共后缀的字符串列表，即使我不知道它是什么。

score 4 · Accepted Answer

这是os.path.commonprefix在反转标题上使用该功能的解决方案：

titles = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
]

# Find the longest common suffix by reversing the strings and using a 
# library function to find the common "prefix".
common_suffix = os.path.commonprefix([title[::-1] for title in titles])[::-1]

# Strips all titles from the number of characters in the common suffix.
stripped_titles = [title[:-len(common_suffix)] for title in titles]

结果：

['艺术画廊 - 博物馆和参观'，'激光游戏 - 娱乐'，'咖啡店 - 舒适和食物']

因为它自己找到共同的后缀，所以它应该适用于任何标题组，即使您不知道后缀。

score 1 · Accepted Answer

如果您确实知道要删除的后缀，则可以简单地执行以下操作：

suffix = " | expand knowledge"

your_list = ['art gallery - museum and visits | expand knowledge',
 'lasergame - entertainment | expand knowledge',
 'coffee shop - confort and food | expand knowledge',
...]

new_list = [name.rstrip(suffix) for name in your_list]

score 0 · Accepted Answer

如果您确定所有字符串都有共同的后缀，那么这将起到作用：

strings = [
  'art gallery - museum and visits | expand knowledge',
  'lasergame - entertainment | expand knowledge']
suffixlen = len(" | expand knowledge")
print [s[:-suffixlen] for s in strings]

输出：

['art gallery - museum and visits', 'lasergame - entertainment']

python - 在字符串列表中查找重复模式

3 回答 3

Related

Reference