29

有没有办法在不拆分转义字符的情况下拆分字符串?例如,我有一个字符串,想用 ':' 分割,而不是 '\:'

http\://www.example.url:ftp\://www.example.url

结果应如下所示:

['http\://www.example.url' , 'ftp\://www.example.url']
4

10 回答 10

48

使用带有否定后向断言的正则表达式有一种更简单的方法:

re.split(r'(?<!\\):', str)
于 2014-01-14T07:18:56.313 回答
10

正如伊格纳西奥所说,的,但并非一气呵成。问题是您需要回顾以确定您是否在转义分隔符处,并且基本string.split不提供该功能。

如果这不在紧密循环内,因此性能不是一个重要问题,您可以通过首先拆分转义分隔符,然后执行拆分,然后合并来实现。丑陋的演示代码如下:

# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
    # split by escaped, then by not-escaped
    escaped_delim = '\\'+delim
    sections = [p.split(delim) for p in s.split(escaped_delim)] 
    ret = []
    prev = None
    for parts in sections: # for each list of "real" splits
        if prev is None:
            if len(parts) > 1:
                # Add first item, unless it's also the last in its section
                ret.append(parts[0])
        else:
            # Add the previous last item joined to the first item
            ret.append(escaped_delim.join([prev, parts[0]]))
        for part in parts[1:-1]:
            # Add all the items in the middle
            ret.append(part)
        prev = parts[-1]
    return ret

s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':')) 
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']

或者,如果您只是手动拆分字符串,则可能更容易遵循逻辑。

def escaped_split(s, delim):
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == '\\':
            try:
                # skip the next character; it has been escaped!
                current.append('\\')
                current.append(next(itr))
            except StopIteration:
                pass
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret

请注意,第二个版本在遇到双转义后跟分隔符时的行为略有不同:此函数允许转义转义字符,因此escaped_split(r'a\\:b', ':')返回['a\\\\', 'b'],因为第一个\转义第二个,留:​​下 被解释为真正的分隔符。所以这是需要注意的。

于 2013-08-07T00:07:37.507 回答
5

亨利的答案的编辑版本与 Python3 兼容性,测试并修复了一些问题:

def split_unescape(s, delim, escape='\\', unescape=True):
    """
    >>> split_unescape('foo,bar', ',')
    ['foo', 'bar']
    >>> split_unescape('foo$,bar', ',', '$')
    ['foo,bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=True)
    ['foo$', 'bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=False)
    ['foo$$', 'bar']
    >>> split_unescape('foo$', ',', '$', unescape=True)
    ['foo$']
    """
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == escape:
            try:
                # skip the next character; it has been escaped!
                if not unescape:
                    current.append(escape)
                current.append(next(itr))
            except StopIteration:
                if unescape:
                    current.append(escape)
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret
于 2014-02-19T13:57:57.823 回答
4

这是一个正确处理双重转义的有效解决方案,即不会转义任何后续分隔符。它忽略了一个不正确的单转义符作为字符串的最后一个字符。

它非常有效,因为它只对输入字符串进行一次迭代,操作索引而不是复制字符串。它不是构造一个列表,而是返回一个生成器。

def split_esc(string, delimiter):
    if len(delimiter) != 1:
        raise ValueError('Invalid delimiter: ' + delimiter)
    ln = len(string)
    i = 0
    j = 0
    while j < ln:
        if string[j] == '\\':
            if j + 1 >= ln:
                yield string[i:j]
                return
            j += 1
        elif string[j] == delimiter:
            yield string[i:j]
            i = j + 1
        j += 1
    yield string[i:j]

要允许分隔符长于单个字符,只需在“elif”情况下将 i 和 j 增加分隔符的长度即可。这假定单个转义字符转义整个分隔符,而不是单个字符。

使用 Python 3.5.1 测试。

于 2016-02-18T22:32:10.067 回答
4

基于@user629923 的建议,但比其他答案简单得多:

import re
DBL_ESC = "!double escape!"

s = r"Hello:World\:Goodbye\\:Cruel\\\:World"

map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))
于 2017-10-13T07:37:38.447 回答
1

没有内置函数。这是一个高效、通用且经过测试的函数,它甚至支持任意长度的分隔符:

def escape_split(s, delim):
    i, res, buf = 0, [], ''
    while True:
        j, e = s.find(delim, i), 0
        if j < 0:  # end reached
            return res + [buf + s[i:]]  # add remainder
        while j - e and s[j - e - 1] == '\\':
            e += 1  # number of escapes
        d = e // 2  # number of double escapes
        if e != d * 2:  # odd number of escapes
            buf += s[i:j - d - 1] + s[j]  # add the escaped char
            i = j + 1  # and skip it
            continue  # add more to buf
        res.append(buf + s[i:j - d])
        i, buf = j + len(delim), ''  # start after delim
于 2015-03-17T19:02:19.157 回答
1

我认为一个简单的 C 类解析会更加简单和健壮。

def escaped_split(str, ch):
    if len(ch) > 1:
        raise ValueError('Expected split character. Found string!')
    out = []
    part = ''
    escape = False
    for i in range(len(str)):
        if not escape and str[i] == ch:
            out.append(part)
            part = ''
        else:
            part += str[i]
            escape = not escape and str[i] == '\\'
    if len(part):
        out.append(part)
    return out
于 2017-03-31T10:01:57.227 回答
0

我创建了这种方法,它的灵感来自于 Henry Keiter 的回答,但具有以下优点:

  • 可变转义字符和分隔符
  • 如果它实际上没有转义某些东西,请不要删除转义字符

这是代码:

def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
    result = []
    current_element = []
    iterator = iter(string)
    for character in iterator:
        if character == self.release_indicator:
            try:
                next_character = next(iterator)
                if next_character != delimiter and next_character != escape:
                    # Do not copy the escape character if it is inteded to escape either the delimiter or the
                    # escape character itself. Copy the escape character if it is not in use to escape one of these
                    # characters.
                    current_element.append(escape)
                current_element.append(next_character)
            except StopIteration:
                current_element.append(escape)
        elif character == delimiter:
            # split! (add current to the list and reset it)
            result.append(''.join(current_element))
            current_element = []
        else:
            current_element.append(character)
    result.append(''.join(current_element))
    return result

这是指示行为的测试代码:

def test_split_string(self):
    # Verify normal behavior
    self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))

    # Verify that escape character escapes the delimiter
    self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))

    # Verify that the escape character escapes the escape character
    self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))

    # Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
    self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))
于 2017-10-03T15:08:13.370 回答
0

我真的知道这是一个老问题,但我最近需要这样的功能,但没有找到任何符合我要求的功能。

规则:

  • 转义字符仅在与转义字符或分隔符一起使用时才有效。前任。if delimiter is /and escape are \then ( \a\b\c/abcbacam['\a\b\c', 'abc']
  • 多个转义字符将被转义。(\\变成\

因此,作为记录,如果有人看起来像什么,这里是我的功能建议:

def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
    """Splits an string using delimiter and escape chars

    Args:
        str_to_escape ([type]): The text to be splitted
        delimiter (str, optional): Delimiter used. Defaults to ','.
        escape (str, optional): The escape char. Defaults to '\'.

    Yields:
        [type]: a list of string to be escaped
    """
    if len(delimiter) > 1 or len(escape) > 1:
        raise ValueError("Either delimiter or escape must be an one char value")
    token = ''
    escaped = False
    for c in str_to_escape:
        if c == escape:
            if escaped:
                token += escape
                escaped = False
            else:
                escaped = True
            continue
        if c == delimiter:
            if not escaped:
                yield token
                token = ''
            else:
                token += c
                escaped = False
        else:
            if escaped:
                token += escape
                escaped = False
            token += c
    yield token

为了理智,我正在做一些测试:

# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
    ('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
    ('r/\\/teste/g', ['r', '/teste', 'g']),
    ('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
     ['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
    ('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
    ('r/\\.$//g', ['r', '\\.$', '', 'g']),
    ('u///g', ['u', '', '', 'g']),
    ('s/(/[/g', ['s', '(', '[', 'g']),
    ('s/)/]/g', ['s', ')', ']', 'g']),
    ('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
    ('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
    ('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]

tests_bar_escape = [
    ('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]

def test(test_array, escape):
    """From input data, test escape functions

    Args:
        test_array ([type]): [description]
        escape ([type]): [description]
    """
    for t in test_array:
        resg = str_escape_split(t[0], '/', escape)
        res = list(resg)
        if res == t[1]:
            print(f"Test {t[0]}: {res} - Pass!")
        else:
            print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")


def test_all():
    test(tests_slash_escape, '\\')
    test(tests_bar_escape, '|')


if __name__ == "__main__":
    test_all()
于 2020-07-26T13:44:02.240 回答
-4

请注意 : 似乎不是需要转义的字符。

我能想到的最简单的方法是拆分角色,然后在转义时将其重新添加。

示例代码(非常需要一些整理。):

def splitNoEscapes(string, char):
    sections = string.split(char)
    sections = [i + (char if i[-1] == "\\" else "") for i in sections]
    result = ["" for i in sections]
    j = 0
    for s in sections:
        result[j] += s
        j += (1 if s[-1] != char else 0)
    return [i for i in result if i != ""]
于 2013-08-07T00:01:08.383 回答