在正则表达式世界中,“匹配”通常比“拆分”更容易。当您“匹配”时,您直接告诉 RE 引擎您正在寻找什么样的子字符串,而不是专注于分隔字符。您问题中的要求有点不清楚,但让我们假设
- “姓”是第一个逗号之前的所有内容
- “姓名”是“办公室”之前的一切
- "office" 由字符串末尾的非空格字符组成
这翻译成这样的正则表达式语言:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
(.+?) # match everything, until next match occurs
(\S+) # non-space characters
$ # end
"""
测试:
import re
rr = re.compile(rr, re.VERBOSE)
print rr.findall("de Batz de Castelmore d'Artagnan, Charles Ogier W.12.345")
# [("de Batz de Castelmore d'Artagnan", ', Charles Ogier ', 'W.12.345')]
更新:
rr = r"""
^ # begin
([^,]+) # match everything but a comma
[,\s]+ # a comma and spaces
(.+?) # match everything until the next match
\s* # spaces
([A-Z]) # an uppercase letter
\. # a dot
(\d+) # some digits
\. # a dot
(\d+) # some digits
\s* # maybe some spaces or newlines
$ # end
"""
import re
rr = re.compile(rr, re.VERBOSE)
s = 'Wegner, Sven Ake G.15.10\n'
print rr.findall(s)
# [('Wegner', 'Sven Ake', 'G', '15', '10')]