这是解决问题的一种方法(不使用任何正则表达式,尽管有一个地方可以)。我们将问题拆分为两个函数:一个函数将字符串拆分为逗号分隔的片段并处理每个片段 ( parseTags
),另一个函数接受字符串并将其处理为有效标签 ( sanitizeTag
)。注释代码如下:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
事实上,如果我们运行这段代码,我们会得到:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
这段代码中有两点值得澄清。首先是str.split()
in的使用sanitizeTags
。这将a b c
变成['a','b','c']
,而str.split(' ')
将产生['','a','b','c','']
。这几乎肯定是您想要的行为,但有一个极端情况。考虑字符串tAG$
。变成一个空间,$
并被分裂剥离;因此,这变成了tAG
而不是tag
. 这可能是你想要的,但如果不是,你必须小心。我要做的是将该行更改为words = re.split(r'\s+', str)
,这会将字符串拆分为空格,但保留前导和尾随的空字符串;但是,我也会更改parseTags
为使用rawTags = re.split(r'\s*,\s*', str)
. 您必须进行这两项更改;'a , b , c'.split(',') becomes ['a ', ' b ', ' c']
, 这不是您想要的行为,同时r'\s*,\s*'
也会删除逗号周围的空格。如果忽略前导和尾随空格,则差异无关紧要;但如果你不这样做,那么你需要小心。
最后,没有使用正则表达式,而是使用str = ''.join(c if c.isalnum() else ' ' for c in str)
. 如果需要,您可以将其替换为正则表达式。(编辑:我在这里删除了一些关于 Unicode 和正则表达式的不准确之处。)忽略 Unicode,您可以将这一行替换为
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
这用于[^...]
匹配除列出的字符之外的所有字符:ASCII 字母和数字。但是,最好支持 Unicode,而且也很简单。最简单的方法是
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
这里,\W
匹配非单词字符;单词字符是字母、数字或下划线。flags=re.UNICODE
指定(在 Python 2.7 之前不可用;您可以改为用于r'(?u)\W'
早期版本和2.7),字母和数字都是任何适当的 Unicode 字符;没有它,它们只是ASCII。如果您不想要下划线,您也可以添加|_
到正则表达式以匹配下划线,也可以用空格替换它们:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
我相信,最后一个与我不使用正则表达式的代码的行为完全匹配。
另外,这就是我在没有这些注释的情况下编写相同代码的方法;这也允许我消除一些临时变量。您可能更喜欢存在变量的代码;这只是口味问题。
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
为了处理新期望的行为,我们必须做两件事。首先,我们需要一种方法来固定第一个单词的大小写:如果第一个字母小写,则将整个字母小写,如果第一个字母大写,则将除首字母以外的所有字母小写。这很简单:我们可以直接检查。其次,我们希望将标点符号视为完全不可见:它不应该将以下单词大写。同样,这很容易——我什至讨论了如何处理上面类似的事情。我们只是过滤掉所有非字母数字、非空白字符,而不是将它们变成空格。结合这些变化给了我们
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
运行此代码为我们提供以下输出
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se@%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']