python - 如何使用列表中的正则表达式清理字符串

Question

在可能采用以下形式的每种文件名中：

String1_Todelete_restofstring.txt
String2_Alsostoremove_restofstring.txt
String3_2013_restofstring.txt
String4_2011_restofstring.txt
String5_restofstring_tosuppress.txt

我想用re.sub定义一个函数来删除列表中定义的所有关键字，（或任何字典）包括：

“Todelete”，2013，2011，“Alsoremove”，“tosuppress”

这样，上面的示例（可能包括不同的日期）将变为：

String1_restofstring.txt
String2_restofstring.txt
String3_restofstring.txt
String4_restofstring.txt
String5_restofstring.txt

请指教

___编辑 _

感谢您提供有用的答案。我发现 Cobabunga 的实现很紧凑，可以在一个函数中实现。关于评论中的问题，我的意图是让问题尽可能通用，以允许各种解决方案，甚至考虑到我认为它也可以在正则表达式中实施的日期。

score 2 · Accepted Answer

您可以构建一个正则表达式，其中包含您要删除的所有单词，如下所示：

import re

to_remove = ["Todelete", "2013", "2011", "Alsotoremove", "tosuppress"]
pattern = "|".join("_?" + re.escape(x) for x in to_remove)

names = ["String1_Todelete_restofstring.txt",
         "String2_Alsotoremove_restofstring.txt",
         "String3_2013_restofstring.txt",
         "String4_2011_restofstring.txt",
         "String5_restofstring_tosuppress.txt"]

names_replaced = [re.sub(pattern, "", x) for x in names]
print names_replaced

请注意，我在每个被替换的单词之前都包含了一个可选的下划线（'_'），因为如果你只Todelete在第一个示例中替换，你最终会得到String1__restofstring.txt而不是String1_restofstring.txt.

对于您的特定示例，这re.escape不是必需的，但是如果您的单词包含在正则表达式中具有特殊含义的任何字符，那么如果没有它，您会得到意想不到的结果。

score 1 · Accepted Answer

这有效：

import re

st='''\
String1_Todelete_restofstring.txt
String2_Alsotoremove_restofstring.txt
String3_2013_restofstring.txt
String4_2011_restofstring.txt
String5_restofstring_tosuppress.txt'''

deletions=["Todelete", '2013','2011', "Alsotoremove","tosuppress"]

for line in st.splitlines():
    for deletion in deletions:
        if re.search('_'+deletion,line):
            line=re.sub('_'+deletion,'',line)
    print line

编辑

正如评论中指出的那样，这re.search是多余的。

此外，在特定情况下， str.replace更快：

import re
import timeit 

st='''\
String1_Todelete_restofstring.txt
String2_Alsotoremove_restofstring.txt
String3_2013_restofstring.txt
String4_2011_restofstring.txt
String5_restofstring_tosuppress.txt'''

deletions=["Todelete", '2013','2011', "Alsotoremove","tosuppress"]


def rep():
    for line in st.splitlines():
        for deletion in deletions:
            line=line.replace('_'+deletion,'')


def reg():
    for line in st.splitlines():
        for deletion in deletions:
            line=re.sub('_'+deletion,'',line)            


print timeit.timeit('reg()', setup='from __main__ import reg', number=10000)     
print timeit.timeit('rep()', setup='from __main__ import rep', number=10000)

在我的机器上，str.replace()大约快 5 倍。

score 1 · Accepted Answer

这可能比扫描每个字符串的次数与您拥有关键字的次数一样高效。

import re

strings = """String1_Todelete_restofstring.txt
String2_Alsotoremove_restofstring.txt
String3_2013_restofstring.txt
String4_2011_restofstring.txt
String5_restofstring_tosuppress.txt""".split()

keywords = set(("Todelete", "2013","2011", "Alsotoremove","tosuppress"))

for s in strings:
    print re.sub("_[^_.]+", lambda m: "" if m.group(0)[1:] in keywords else m.group(0), s)

score 0 · Accepted Answer

给你一个想法（不长，因为我在我的手机上）；

/(.*?)_.*?_(.*?)\.(\w{2,})/

group(1) + '_' + group(2) + '.' + group(3)

python - 如何使用列表中的正则表达式清理字符串

4 回答 4

Related

Reference