python - 鉴于我知道较长的字符串不区分大小写匹配，我应该如何从另一个字符串的开头删除一个字符串？

Question

假设我有一个工作流，其中涉及检查一个长字符串的开头（LS比如说），看看它是否以一个较短的字符串开头SS。如果是这样，我砍掉匹配的部分LS并对剩余的部分做一些事情。否则，我会做其他事情。（引发这个问题的具体案例是一个解析库。）

def do_thing(LS, SS):
    if (LS.startswith(SS)):
        action_on_match(LS[len(SS):])
    else:
        action_on_no_match()

这很简单。不过，现在假设我想做同样的事情，但这次我希望字符串不区分大小写。可以测试是否“LS.startswith(SS)但不区分大小写”。但是当我将它传递给时，我应该如何确定LS要“砍掉”action_on_match()多少？像以前一样使用是不够的len(SS)，因为如果我是大写或小写或大小写折叠的东西，那么匹配前缀的长度LS可能不是我所期望的：改变字符串的大小写可以改变它的长度. 重要的是，LS传递的部分action_on_match()与程序作为输入接收的内容完全相同（当然，在截止点之后）。

回答者建议使用lower()并保留使用len(SS)，但这不起作用：

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:15:05) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def action_on_match (s): return "Match: %s" % s
...
>>> def action_on_no_match (): return "No match"
...
>>> def do_thing (LS, SS):
...     if LS.lower().startswith(SS.lower()):
...         return action_on_match(LS[len(SS):])
...     else:
...         return action_on_no_match()
...
>>> do_thing('i\u0307asdf', '\u0130')
'Match: \u0307asdf'
>>>

在这里我们希望看到'Match: asdf'，但是有一个额外的字符。

score 3 · Accepted Answer

很简单：

def do_thing(LS, SS):
    if LS.lower().startswith(SS.lower()):
        action_on_match(LS[len(SS):])
    else:
        action_on_no_match()

我所做的只是将两者都小写LS，SS然后比较它们。对于非常长的字符串，这将比正则表达式解决方案慢得多，因为它必须首先将整个字符串转换为小写。

正则表达式解决方案如下所示：

import re

def do_thing(LS, SS):
    if re.match("^%s" % SS, LS, re.I):
        action_on_match(LS[len(SS):])
    else:
        action_on_no_match()

表现

len(LL)对于超过 1000000 次迭代的短字符串（ == 8 个字符）：

lower()方法：0.86s （获胜者）
re方法：1.91s

len(LL)对于超过 1000000 次迭代的长字符串（ == 600 个字符）：

lower()方法：2.54s
re方法：1.96s （获胜者）

Unicode 组合字符

对于 unicode 组合字符，需要先对数据进行归一化处理。这意味着将任何预先组合的字符转换为其组成部分。例如，您会发现：

>>> '\u0130' == 'I\u0307'
False
>>> normalize("NFD", '\u0130') == normalize("NFD", 'I\u0307')
True

您将需要对输入执行此规范化过程：

SS = normalize("NFD", SS)
LS = normalize("NFD", LS)

score 0 · Accepted Answer

只需使用str.lower，的长度"FOO"将与相同"foo".lower()：

LS.lower().startswith(SS.lower())



def do_thing(ls, ss):
    if ls.startswith(ss):
        action_on_match(ls[len(ss):])
    else:
        action_on_no_match()

python - 鉴于我知道较长的字符串不区分大小写匹配，我应该如何从另一个字符串的开头删除一个字符串？

2 回答 2

Related

Reference