试试这段代码:
str = u"BBC \xe2 abc - Here is the text"
m = re.search(ur'^(.*? [-\xe2] )?(.*)', str, re.UNICODE)
# or equivalent
# m = re.match(ur'(.*? [-\xe2] )?(.*)', str, re.UNICODE)
# You don't really need re.UNICODE, but if you want to use unicode
# characters, it's better you conside à to be a letter :-) , so re.UNICODE
# group(1) contains the part before the hypen
if m.group(1) is not None:
print m.group(1)
# group(2) contains the part after the hypen or all the string
# if there is no hypen
print m.group(2)
正则表达式的解释:
^ is the beginning of the string (the match method always use the beginning
of the string)
(...) creates a capturing group (something that will go in group(...)
(...)? is an optional group
[-\xe2] one character between - and \xe2 (you can put any number of characters
in the [], like [abc] means a or b or c
.*? [-\xe2] (there is a space after the ]) any character followed by a space, an hypen and a space
the *? means that the * is "lazy" so it will try to catch only the
minimum number possible of characters, so ABC - DEF - GHI
.* - would catch ABC - DEF -, while .* - will catch ABC -
so
(.* [-\xe2] )? the string could start with any character followed by an hypen
if yes, put it in group(1), if no group(1) will be None
(.*) and it will be followed by any character. You dont need the
$ (that is the end-of the string, opposite of ^) because * will
always eat all the characters it can eat (it's an eager operator)