python - Python正则表达式删除空格并将空格所在的字母大写？

Question

我想从用户提供的单个输入框中创建一个标签列表，用逗号分隔，我正在寻找一些可以帮助自动执行此操作的表达式。

我想要的是提供输入字段和：

删除所有双+空格、制表符、换行符（只留下单个空格）
删除所有（单引号和双引号）引号，除了逗号，它只能有一个
在每个逗号之间，我想要Something Like Title Case，但不包括第一个单词，而不是单个单词，这样当删除最后一个空格时，标签就会显示为“somethingLikeTitleCase”或“something”或“twoWords” '
最后，删除所有剩余的空格

到目前为止，这是我在 SO 周围收集的内容：

def no_whitespace(s):
"""Remove all whitespace & newlines. """
    return re.sub(r"(?m)\s+", "", s)


# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051

  tag_list = ''.join(no_whitespace(tags_input))

# split into a list at comma's

  tag_list = tag_list.split(',')

# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings

  tag_list = filter(None, tag_list)

但是，在修改该正则表达式以删除除逗号以外的所有标点符号时，我迷失了方向，我什至不知道从哪里开始大写。

有什么想法可以让我朝着正确的方向前进吗？

如建议的那样，这里有一些示例输入 = desired_outputs

form: 'tHiS is a tag, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' 应该是 ['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

score 3 · Accepted Answer

这是解决问题的一种方法（不使用任何正则表达式，尽管有一个地方可以）。我们将问题拆分为两个函数：一个函数将字符串拆分为逗号分隔的片段并处理每个片段 ( parseTags)，另一个函数接受字符串并将其处理为有效标签 ( sanitizeTag)。注释代码如下：

# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
    # First, we split the string on commas.
    rawTags = str.split(',')

    # Then, we sanitize each of the tags.  If sanitizing gives us back None,
    # then the tag was invalid, so we leave those cases out of our final
    # list of tags.  We can use None as the predicate because sanitizeTag
    # will never return '', which is the only falsy string.
    return filter(None, map(sanitizeTag, rawTags))

# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it.  It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
    # First, we turn non-alphanumeric characters into whitespace.  You could
    # also use a regular expression here; see below.
    str = ''.join(c if c.isalnum() else ' ' for c in str)

    # Next, we split the string on spaces, ignoring leading and trailing
    # whitespace.
    words = str.split()

    # There are now three possibilities: there are no words, there was one
    # word, or there were multiple words.
    numWords = len(words)
    if numWords == 0:
        # If there were no words, the string contained only spaces (and/or
        # punctuation).  This can't be made into a valid tag, so we return
        # None.
        return None
    elif numWords == 1:
        # If there was only one word, that word is the tag, no
        # post-processing required.
        return words[0]
    else:
        # Finally, if there were multiple words, we camel-case the string:
        # we lowercase the first word, capitalize the first letter of all
        # the other words and lowercase the rest, and finally stick all
        # these words together without spaces.
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

事实上，如果我们运行这段代码，我们会得到：

>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

这段代码中有两点值得澄清。首先是str.split()in的使用sanitizeTags。这将a b c变成['a','b','c']，而str.split(' ')将产生['','a','b','c','']。这几乎肯定是您想要的行为，但有一个极端情况。考虑字符串tAG$。变成一个空间，$并被分裂剥离；因此，这变成了tAG而不是tag. 这可能是你想要的，但如果不是，你必须小心。我要做的是将该行更改为words = re.split(r'\s+', str)，这会将字符串拆分为空格，但保留前导和尾随的空字符串；但是，我也会更改parseTags为使用rawTags = re.split(r'\s*,\s*', str). 您必须进行这两项更改；'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], 这不是您想要的行为，同时r'\s*,\s*'也会删除逗号周围的空格。如果忽略前导和尾随空格，则差异无关紧要；但如果你不这样做，那么你需要小心。

最后，没有使用正则表达式，而是使用str = ''.join(c if c.isalnum() else ' ' for c in str). 如果需要，您可以将其替换为正则表达式。（编辑：我在这里删除了一些关于 Unicode 和正则表达式的不准确之处。）忽略 Unicode，您可以将这一行替换为

str = re.sub(r'[^A-Za-z0-9]', ' ', str)

这用于[^...]匹配除列出的字符之外的所有字符：ASCII 字母和数字。但是，最好支持 Unicode，而且也很简单。最简单的方法是

str = re.sub(r'\W', ' ', str, flags=re.UNICODE)

这里，\W匹配非单词字符；单词字符是字母、数字或下划线。flags=re.UNICODE指定（在 Python 2.7 之前不可用；您可以改为用于r'(?u)\W'早期版本和2.7），字母和数字都是任何适当的 Unicode 字符；没有它，它们只是ASCII。如果您不想要下划线，您也可以添加|_到正则表达式以匹配下划线，也可以用空格替换它们：

str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)

我相信，最后一个与我不使用正则表达式的代码的行为完全匹配。

另外，这就是我在没有这些注释的情况下编写相同代码的方法；这也允许我消除一些临时变量。您可能更喜欢存在变量的代码；这只是口味问题。

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = ''.join(c if c.isalnum() else ' ' for c in str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

为了处理新期望的行为，我们必须做两件事。首先，我们需要一种方法来固定第一个单词的大小写：如果第一个字母小写，则将整个字母小写，如果第一个字母大写，则将除首字母以外的所有字母小写。这很简单：我们可以直接检查。其次，我们希望将标点符号视为完全不可见：它不应该将以下单词大写。同样，这很容易——我什至讨论了如何处理上面类似的事情。我们只是过滤掉所有非字母数字、非空白字符，而不是将它们变成空格。结合这些变化给了我们

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = filter(lambda c: c.isalnum() or c.isspace(), str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
        return words0 + ''.join(w.capitalize() for w in words[1:])

运行此代码为我们提供以下输出

>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se@%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']

score 1 · Accepted Answer

您可以使用允许出现在单词中的字符的白名单，其他所有内容都将被忽略：

import re

def camelCase(tag_str):
    words = re.findall(r'\w+', tag_str)
    nwords = len(words)
    if nwords == 1:
        return words[0] # leave unchanged
    elif nwords > 1: # make it camelCaseTag
        return words[0].lower() + ''.join(map(str.title, words[1:]))
    return '' # no word characters

此示例使用\w单词字符。

例子

tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, 
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))

输出

thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps

score 0 · Accepted Answer

我认为这应该有效

def toCamelCase(s):
  # remove all punctuation
  # modify to include other characters you may want to keep
  s = re.sub("[^a-zA-Z0-9\s]","",s)

  # remove leading spaces
  s = re.sub("^\s+","",s)

  # camel case
  s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)

  # remove all punctuation and spaces
  s = re.sub("[^a-zA-Z0-9]", "", s)
  return s

tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]

这里的关键是利用 re.sub 进行您想要的替换。

编辑：不保留大写字母，但确实处理带有空格的大写字符串

编辑：在 toCamelCase 调用之后移动了“if s”

python - Python正则表达式删除空格并将空格所在的字母大写？

3 回答 3

例子

输出

Related

Reference