python - 使用 Python 正则表达式从带有汉字的推文中识别转发者

Question

鉴于新浪微博的一条推文：

  tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户，摆脱屌丝男！！！//@MarkGreene: 转发微博"

请注意，// 和@诺什之间有一个空格。

我想获取转发者列表，如下所示：

  result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']

我一直在考虑使用以下脚本：

RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet)

然而，我没能得到“魏武”这个中文单词。

score 2 · Accepted Answer

使用re.UNICODE标志：

re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character 
properties database.

tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户，摆脱屌丝男！！！//@MarkGreene: 转发微博"
RTpattern = r'''//?@(\w+)'''
for word in re.findall(RTpattern, tweet, re.UNICODE):
    print word

# lilei
# Bob
# Girl
# 魏武
# MarkGreene

python - 使用 Python 正则表达式从带有汉字的推文中识别转发者

1 回答 1

Related

Reference