python - 正则表达式捕获多行文本正文

Question

所以我有一些看起来像这样的文本文档：

1a  Title
        Subtitle
            Description
1b  Title
        Subtitle A
            Description
        Subtitle B
            Description
2   Title
        Subtitle A
            Description
        Subtitle B
            Description
        Subtitle C
            Description

我正在尝试使用正则表达式捕获由 3 个制表符缩进的“描述”行。我遇到的问题是有时描述行会换行到下一行并再次缩进 3 个制表符。这是一个例子：

1   Demo
        Example
            This is the description text body that I am
            trying to capture with regex.

我想在一组中捕获此文本，最终得到：

This is the description text body that I am trying to capture with regex.

一旦我能够做到这一点，我还想“展平”文档，使每一部分在一行上由字符而不是行和制表符分隔。所以我的示例代码将变为：

1->Demo->->Example->->->This is the description text...

我将在 Python 中实现这一点，但任何正则表达式指导将不胜感激！

UPTADE
我已经更改了扁平文本中的分隔符以表明它是以前的关系。IE; 1 个选项卡->、2 个选项卡->->、3 个选项卡->->->等等。

此外，如果每个标题（部分）有多个字幕（子部分），则扁平化文本的外观如下：

1a->标题->->副标题->->->描述
1b->标题->->副标题 A->->->描述
1b->标题->->副标题 B->->->描述
2->标题->->副标题A->->->描述
2->标题->->副标题B->->->描述
2->标题->->副标题C->->->描述

基本上只是为每个孩子（字幕）“重用”父母（数字/标题）。

score 2 · Accepted Answer

您可以在没有正则表达式的情况下执行此操作：

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
\t\tSep
\t\t\tAnd Another Section
\t\t\tOn two lines
'''

cap=[]
buf=[]
for line in txt.splitlines():
    if line.startswith('\t\t\t'):
        buf.append(line.strip())
        continue
    if buf:    
        cap.append(' '.join(buf))
        buf=[]
else:
    if buf:    
        cap.append(' '.join(buf))      

print cap

印刷：

['This is the description text body that I am trying to capture with regex.', 
 'And Another Section On two lines']

优点是用 3 个制表符分别缩进的不同部分仍然是可分离的。

好的：这是正则表达式中的完整解决方案：

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
2\tSecond Demo
\t\tAnother Section
\t\t\tAnd Another 3rd level Section
\t\t\tOn two lines
3\tNo section below
4\tOnly one level below
\t\tThis is that one level
'''

import re

result=[]
for ms in re.finditer(r'^(\d+.*?)(?=^\d|\Z)',txt,re.S | re.M):
    section=ms.group(1)
    tm=map(len,re.findall(r'(^\t+)', section, re.S | re.M))
    subsections=max(tm) if tm else 0
    sec=[re.search(r'(^\d+.*)', section).group(1)]
    if subsections:
        for i in range(2,subsections+1):
            lt=r'^{}([^\t]+)$'.format(r'\t'*i)
            level=re.findall(lt, section, re.M)
            sec.append(' '.join(s.strip() for s in level))

    print '->'.join(sec)

印刷：

1   Demo->Example->This is the description text body that I am trying to capture with regex.
2   Second Demo->Another Section->And Another 3rd level Section On two lines
3   No section below
4   Only one level below->This is that one level

限制：

1) This is limited to the format you described.
2) It will not handle reverse levels properly:
    1 Section 
         Second Level
             Third Level
         Second Level Again       <== This would be jammed in with 'second level'
    How would you handel multi levels?

3) Won't handle multiline section headers:

    3    Like
         This

在您的完整示例上运行它：

1a  Title->Subtitle->Description Second Line of Description
1b  Title->Subtitle A Subtitle B->Description Description
2   Title->Subtitle A Subtitle B Subtitle C->Description Description Description

您可以看到第二和第三级是连接，但我不知道您希望如何处理该格式。

score 0 · Accepted Answer

这个怎么样？

re.findall(r'(?m)((?:^\t{3}.*?\n)+)', doc)

它还将捕获制表符和换行符，但可以稍后将其删除。

score 0 · Accepted Answer

~~使用repython2：~~

text = "yourtexthere" lines = re.findall("\t{3}.+", text)

没有标签"\t"：

text = "yourtexthere" lines = [i[3:] for i in re.findall("\t{3}.+", text)]

要获得最终输出：

...<br> "\n".join(lines)

使固定：

还不是很好，但我正在努力：

import re
text = "..."
out = [i for i in re.findall("\t{2,3}.+", text.replace("    ", "\t"))]
fixed = []
sub = []
for i in out:
    if not i.startswith("\t"*3):
        if sub: fixed.append(tuple(sub)); sub = []
    else:
        sub.append(i)
if sub:
    fixed.append(tuple(sub))
print fixed

python - 正则表达式捕获多行文本正文

3 回答 3

限制：

使固定：

Related

Reference