regex - 所需格式的字符串分离，Pythonic 方式？（带或不带正则表达式）

Question

我有一个格式为：

t='@abc @def Hello this part is text'

我想得到这个：

l=["abc", "def"] 
s='Hello this part is text'

我这样做了：

a=t[t.find(' ',t.rfind('@')):].strip()
s=t[:t.find(' ',t.rfind('@'))].strip()
b=a.split('@')
l=[i.strip() for i in b][1:]

它在大多数情况下都有效，但是当文本部分有“@”时它会失败。例如，当：

t='@abc @def My email is red@hjk.com'

它失败。@names 在开头，@names 后面可以有文本，其中可能包含@。

显然，我可以在最初添加一个空格并找出没有'@'的第一个单词。但这似乎不是一个优雅的解决方案。

解决这个问题的pythonic方法是什么？

score 13 · Accepted Answer

以 MrTopf 的努力为基础：

import re
rx = re.compile("((?:@\w+ +)+)(.*)")
t='@abc   @def  @xyz Hello this part is text and my email is foo@ba.r'
a,s = rx.match(t).groups()
l = re.split('[@ ]+',a)[1:-1]
print l
print s

印刷：

['abc', 'def', 'xyz']
你好，这部分是文本，我的电子邮件是 foo@ba.r

hasen j刚刚要求负责，让我澄清一下这是如何工作的：

/@\w+ +/

匹配单个标签 - @ 后跟至少一个字母数字或 _ 后跟至少一个空格字符。+ 是贪心的，所以如果有多个空格，它会全部抓取。

要匹配任意数量的这些标签，我们需要在标签模式中添加一个加号（一个或多个事物）；所以我们需要用括号对其进行分组：

/(@\w+ +)+/

它匹配一个或多个标签，并且贪婪地匹配所有标签。然而，这些括号现在与我们的捕获组一起摆弄，所以我们通过将它们变成一个匿名组来撤销它：

/(?:@\w+ +)+/

最后，我们把它变成一个捕获组并添加另一个来清扫其余的：

/((?:@\w+ +)+)(.*)/

最后总结一下：

((?:@\w+ +)+)(.*)
 (?:@\w+ +)+
 (  @\w+ +)
    @\w+ +

请注意，在查看此内容时，我对其进行了改进 - \w 不需要在一个集合中，现在它允许标签之间有多个空格。谢谢，哈森-j！

score 7 · Accepted Answer

t='@abc @def Hello this part is text'

words = t.split(' ')

names = []
while words:
    w = words.pop(0)
    if w.startswith('@'):
        names.append(w[1:])
    else:
        break

text = ' '.join(words)

print names
print text

score 5 · Accepted Answer

这个怎么样：

按空间分割。
foreach 单词，检查

2.1。如果单词以 @ 开头，则推送到第一个列表

2.2. 否则只需用空格连接剩余的单词。

score 3 · Accepted Answer

 [i.strip('@') for i in t.split(' ', 2)[:2]]     # for a fixed number of @def
 a = [i.strip('@') for i in t.split(' ') if i.startswith('@')]
 s = ' '.join(i for i in t.split(' ') if not i.startwith('@'))

score 3 · Accepted Answer

您还可以使用正则表达式：

import re
rx = re.compile("@([\w]+) @([\w]+) (.*)")
t='@abc @def Hello this part is text and my email is foo@ba.r'
a,b,s = rx.match(t).groups()

但这一切都取决于您的数据的外观。所以你可能需要调整它。它所做的基本上是通过 () 创建组并检查其中允许的内容。

score 3 · Accepted Answer

[编辑：这是实施上面奥萨马的建议]

这将基于字符串开头的 @ 变量创建 L，然后一旦找到非 @ var，就抓取字符串的其余部分。

t = '@one @two @three some text   afterward with @ symbols@ meow@meow'

words = t.split(' ')         # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)):  # go through each word
    word = words[i]
    if word[0] == '@':       # grab @'s from beginning of string
        L.append(word[1:])
        continue
    s = ' '.join(words[i:])  # put spaces back in
    break                    # you can ignore the rest of the words

您可以将其重构为更少的代码，但我试图让正在发生的事情变得明显。

score 1 · Accepted Answer

这只是另一个使用 split() 且没有正则表达式的变体：

t='@abc @def My email is red@hjk.com'
tags = []
words = iter(t.split())

# iterate over words until first non-tag word
for w in words:
  if not w.startswith("@"):
    # join this word and all the following
    s = w + " " + (" ".join(words))
    break
  tags.append(w[1:])
else:
  s = "" # handle string with only tags

print tags, s

这是一个较短但可能有点神秘的版本，它使用正则表达式来查找后跟非@字符的第一个空格：

import re
t = '@abc @def My email is red@hjk.com @extra bye'
m = re.search(r"\s([^@].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red@hjk.com @extra bye

如果没有标签或没有文本，这将无法正常工作。格式未指定。您需要提供更多测试用例进行验证。

regex - 所需格式的字符串分离，Pythonic 方式？（带或不带正则表达式）

7 回答 7

Related

Reference