python - 从字符串的开头删除连续字符

Question

最好的办法是去掉有时出现在维基百科参考开头的字母？

例如从

abcd 星球大战前传 III：西斯的复仇 (DVD)。20世纪福克斯。2005 年。

至

星球大战前传 III：西斯的复仇 (DVD)。20世纪福克斯。2005 年。

我已经组合了一个可行的解决方案，但看起来很笨重。我的版本使用'^(?:a (?:b (?:c )?)?)?'形式的正则表达式。什么是正确，快速的方法？

a = list('abcdefghijklmnopqrstuvwxyz')
reg = "^%s%s" % ( "".join(["(?:%s " %b for b in a]), ")?"*len(a) )
re.sub(reg, "", "a b c d Wikipedia Reference")

score 1 · Accepted Answer

在正则表达式中使用字符类怎么样，即：

re.sub('^([a-z] )*', '', ...)

这应该删除单个字母字符后跟单个空格的任意数量的前导出现。

score 1 · Accepted Answer

我可能会做这样的事情：

title = re.sub(r'^([a-z]\s)*', '', 'a b c d Wikipedia Reference')

这和你在那里得到的一样。但是，就像@joran-beasley 指出的那样，对于复杂的情况，您可能需要更聪明的方法。

score 1 · Accepted Answer

如果你是复制粘贴网页文本而不是处理html，问题中提到的一些问题是不可避免的。但是使用htmllib处理 html （如下所示的相关行），您可以删除像c（贡献c）这样的项目作为单位。[编辑：我现在看到 htmllib 已被弃用；我不知道正确的替换，但相信它是HTMLParser。]

显示的线有点像

^ ^a ^b ^c ^d ^e Star Wars: Episode III Revenge of the Sith DVD 评论，由 George Lucas、Rick McCallum、Rob Coleman、John Knoll 和 Roger Guyett 主演，[2005]

该行的html源代码是

<li id="cite_note-DVDcom-13">^ <a href="#cite_ref-DVDcom_13-0">a</a> <a href="#cite_ref-DVDcom_13-1">b</a> <a href="#cite_ref-DVDcom_13-2">c</a> <a href="#cite_ref-DVDcom_13-3">d</a> <a href="#cite_ref-DVDcom_13-4">e</a> Star Wars: Episode III Revenge of the Sith DVD commentary featuring George Lucas, Rick McCallum, Rob Coleman, John Knoll and Roger Guyett, [2005]</li>

score 0 · Accepted Answer

他们是否总是遵循这种模式，即标题前面有四个额外的带有空格的字母？如果是这样，你可以这样做：

s = "a b c d Star Wars Episode III: Revenge of the Sith (DVD). 20th Century Fox. 2005."
if all([len(x) == 1 and x.isalpha() for x in s.split()[0:4]]):
    print s[8:]

python - 从字符串的开头删除连续字符

4 回答 4

Related

Reference