1

我想从用户提供的单个输入框中创建一个标签列表,用逗号分隔,我正在寻找一些可以帮助自动执行此操作的表达式。

我想要的是提供输入字段和:

  • 删除所有双+空格、制表符、换行符(只留下单个空格)
  • 删除所有(单引号和双引号)引号,除了逗号,它只能有一个
  • 在每个逗号之间,我想要Something Like Title Case,但不包括第一个单词,而不是单个单词,这样当删除最后一个空格时,标签就会显示为“somethingLikeTitleCase”或“something”或“twoWords” '
  • 最后,删除所有剩余的空格

到目前为止,这是我在 SO 周围收集的内容:

def no_whitespace(s):
"""Remove all whitespace & newlines. """
    return re.sub(r"(?m)\s+", "", s)


# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051

  tag_list = ''.join(no_whitespace(tags_input))

# split into a list at comma's

  tag_list = tag_list.split(',')

# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings

  tag_list = filter(None, tag_list)

但是,在修改该正则表达式以删除除逗号以外的所有标点符号时,我迷失了方向,我什至不知道从哪里开始大写。

有什么想法可以让我朝着正确的方向前进吗?


如建议的那样,这里有一些示例输入 = desired_outputs

form: 'tHiS is a tag, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' 应该是 ['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

4

3 回答 3

3

这是解决问题的一种方法(不使用任何正则表达式,尽管有一个地方可以)。我们将问题拆分为两个函数:一个函数将字符串拆分为逗号分隔的片段并处理每个片段 ( parseTags),另一个函数接受字符串并将其处理为有效标签 ( sanitizeTag)。注释代码如下:

# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
    # First, we split the string on commas.
    rawTags = str.split(',')

    # Then, we sanitize each of the tags.  If sanitizing gives us back None,
    # then the tag was invalid, so we leave those cases out of our final
    # list of tags.  We can use None as the predicate because sanitizeTag
    # will never return '', which is the only falsy string.
    return filter(None, map(sanitizeTag, rawTags))

# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it.  It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
    # First, we turn non-alphanumeric characters into whitespace.  You could
    # also use a regular expression here; see below.
    str = ''.join(c if c.isalnum() else ' ' for c in str)

    # Next, we split the string on spaces, ignoring leading and trailing
    # whitespace.
    words = str.split()

    # There are now three possibilities: there are no words, there was one
    # word, or there were multiple words.
    numWords = len(words)
    if numWords == 0:
        # If there were no words, the string contained only spaces (and/or
        # punctuation).  This can't be made into a valid tag, so we return
        # None.
        return None
    elif numWords == 1:
        # If there was only one word, that word is the tag, no
        # post-processing required.
        return words[0]
    else:
        # Finally, if there were multiple words, we camel-case the string:
        # we lowercase the first word, capitalize the first letter of all
        # the other words and lowercase the rest, and finally stick all
        # these words together without spaces.
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

事实上,如果我们运行这段代码,我们会得到:

>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

这段代码中有两点值得澄清。首先是str.split()in的使用sanitizeTags。这将a b c变成['a','b','c'],而str.split(' ')将产生['','a','b','c','']。这几乎肯定是您想要的行为,但有一个极端情况。考虑字符串tAG$。变成一个空间,$并被分裂剥离;因此,这变成了tAG而不是tag. 这可能是你想要的,但如果不是,你必须小心。我要做的是将该行更改为words = re.split(r'\s+', str),这会将字符串拆分为空格,但保留前导和尾随的空字符串;但是,我也会更改parseTags为使用rawTags = re.split(r'\s*,\s*', str). 您必须进行这两项更改;'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], 这不是您想要的行为,同时r'\s*,\s*'也会删除逗号周围的空格。如果忽略前导和尾随空格,则差异无关紧要;但如果你不这样做,那么你需要小心。

最后,没有使用正则表达式,而是使用str = ''.join(c if c.isalnum() else ' ' for c in str). 如果需要,您可以将其替换为正则表达式。(编辑:我在这里删除了一些关于 Unicode 和正则表达式的不准确之处。)忽略 Unicode,您可以将这一行替换为

str = re.sub(r'[^A-Za-z0-9]', ' ', str)

这用于[^...]匹配列出的字符之外的所有字符:ASCII 字母和数字。但是,最好支持 Unicode,而且也很简单。最简单的方法是

str = re.sub(r'\W', ' ', str, flags=re.UNICODE)

这里,\W匹配非单词字符;单词字符是字母、数字或下划线。flags=re.UNICODE指定(在 Python 2.7 之前不可用;您可以改为用于r'(?u)\W'早期版本2.7),字母和数字都是任何适当的 Unicode 字符;没有它,它们只是ASCII。如果您不想要下划线,您也可以添加|_到正则表达式以匹配下划线,也可以用空格替换它们:

str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)

我相信,最后一个与我不使用正则表达式的代码的行为完全匹配。


另外,这就是我在没有这些注释的情况下编写相同代码的方法;这也允许我消除一些临时变量。您可能更喜欢存在变量的代码;这只是口味问题。

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = ''.join(c if c.isalnum() else ' ' for c in str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

为了处理新期望的行为,我们必须做两件事。首先,我们需要一种方法来固定第一个单词的大小写:如果第一个字母小写,则将整个字母小写,如果第一个字母大写,则将除首字母以外的所有字母小写。这很简单:我们可以直接检查。其次,我们希望将标点符号视为完全不可见:它不应该将以下单词大写。同样,这很容易——我什至讨论了如何处理上面类似的事情。我们只是过滤掉所有非字母数字、非空白字符,而不是将它们变成空格。结合这些变化给了我们

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = filter(lambda c: c.isalnum() or c.isspace(), str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
        return words0 + ''.join(w.capitalize() for w in words[1:])

运行此代码为我们提供以下输出

>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se@%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
于 2012-08-22T23:46:22.017 回答
1

您可以使用允许出现在单词中的字符的白名单,其他所有内容都将被忽略:

import re

def camelCase(tag_str):
    words = re.findall(r'\w+', tag_str)
    nwords = len(words)
    if nwords == 1:
        return words[0] # leave unchanged
    elif nwords > 1: # make it camelCaseTag
        return words[0].lower() + ''.join(map(str.title, words[1:]))
    return '' # no word characters

此示例使用\w单词字符。

例子

tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, 
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))

输出

thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
于 2012-08-22T22:44:09.983 回答
0

我认为这应该有效

def toCamelCase(s):
  # remove all punctuation
  # modify to include other characters you may want to keep
  s = re.sub("[^a-zA-Z0-9\s]","",s)

  # remove leading spaces
  s = re.sub("^\s+","",s)

  # camel case
  s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)

  # remove all punctuation and spaces
  s = re.sub("[^a-zA-Z0-9]", "", s)
  return s

tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]

这里的关键是利用 re.sub 进行您想要的替换。

编辑:不保留大写字母,但确实处理带有空格的大写字符串

编辑:在 toCamelCase 调用之后移动了“if s”

于 2012-08-22T22:03:12.193 回答