python - 基于正则表达式中的标识符对元素进行分组

Question

我有一个长字符串，看起来像：

s = 'label("id1","A") label("id1","B") label("id2", "C") label("id2","A") label("id2","D") label("id3","A")'

我想使用正则表达式来创建基于 id 的标签列表。

更清楚地说，从s示例中的字符串中，我想得到一个结果列表，如下所示：

[("id1", ["A","B"]),
 ("id2", ["C","A","D"]),
 ("id3", ["A"])]

使用正则表达式，我设法获取了 id 和元素：

import re
regex = re.compile(r'label\((\S*),(\S*)\)')
results = re.findall(regex,s)

使用此代码，results如下所示：

[('"id1"', '"A"'),
 ('"id1"', '"B"'),
 ('"id2"', '"A"'),
 ('"id2"', '"D"'),
 ('"id3"', '"A"')]

有没有一种简单的方法可以从正则表达式中获取已经正确分组的数据？

score 1 · Accepted Answer

您可以遍历findall()结果并将它们收集到一个collections.defaultdict对象中。请调整您的正则表达式以不包含引号，并添加一些空格容差，但：

from collections import defaultdict
import re

regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
results = defaultdict(list)

for id_, tag in regex.findall(s):
    results[id_].append(tag)

print results.items()

如果您想要的只是唯一值，您可以替换为list和set。append()add()

演示：

>>> from collections import defaultdict
>>> import re
>>> s = 'label("id1","A") label("id1","B") label("id2", "C") label("id2","A") label("id2","D") label("id3","A")'
>>> regex = re.compile(r'label\("([^"]*)",\s*"([^"]*)"\)')
>>> results = defaultdict(list)
>>> for id_, tag in regex.findall(s):
...     results[id_].append(tag)
... 
>>> results.items()
[('id2', ['C', 'A', 'D']), ('id3', ['A']), ('id1', ['A', 'B'])]

如果需要，您也可以对该结果进行排序。

score 0 · Accepted Answer

后处理结果是否可以接受？

如果是这样，

import re
# edited your regex to get rid of the extra quotes, and to allow for the possible space that occurs in label("id2", "C")
regex = re.compile(r'label\(\"(\S*)\",\ ?\"(\S*)\"\)')
results = re.findall(regex,s)
resultDict = {}
for id, val in results:
    if id in resultDict:
        resultDict[id].append(val)
    else:
        resultDict[id] = [val]

# if you really want a list of tuples rather than a dictionary:
resultList = resultDict.items()

python - 基于正则表达式中的标识符对元素进行分组

2 回答 2

Related

Reference