好吧,这可能会过度拟合您的样本,但它确实适合您的样本:
(蟒蛇3)
sample= '''Header_A [HDA]
data
data
data
data
Header_B [HDB]
data
data
data
data
Header_C [HDC]
data
data
data
data'''
lines=[{'raw':x} for x in sample.split('\n')]
largestIndent=0
for line in lines:
line['indent']= (len(line['raw'])-len(line['raw'].lstrip()))//4
line['content']= line['raw'].lstrip()
if line['indent']>largestIndent:
largestIndent=line['indent']
lines=[{'indent':-1, 'content':'', 'raw':''}] + lines
for depth in range(largestIndent,-1,-1):
print ('depth={}'.format(depth))
#print ('lines before ={}'.format(lines))
children=[]
for line in lines[::-1]:
if line['indent']==depth:
children=[line['content']]+children
elif line['indent']==depth-1:
if children !=[]:
line['content']=[line['content']] + children
children=[]
else:
pass
#print ('lines after ={}'.format(lines))
outList=lines[0]['content'][1:]
print(outList)
输出:
[['Header_A [HDA]', 'data', 'data', 'data', 'data'], ['Header_B [HDB]', 'data', 'data', 'data', 'data', ['Header_C [HDC]', 'data', 'data', 'data', 'data']]]
没有正则表达式!
据我所知,不可能使正则表达式智能地解析任意嵌套的文本。