python - 将行拆分为段落

Question

输入：行列表

输出：行列表的列表，它是在（一个或多个的序列）空行处拆分的输入列表。

这是我迄今为止最不难看的解决方案：

split_at_empty(lines):
    paragraphs = []
    p = []
    def flush():
        if p:
            paragraphs.append(p)
        p = []
    for l in lines:
        if l:
            p.append(l)
        else:
            flush()
    flush()
    return paragraphs

必须有更好的解决方案（甚至可能是功能性的）！任何人？

示例输入列表：

['','2','3','','5','6','7','8','','','11']

输出：

[['2','3'],['5','6','7','8'],['11']]

score 2 · Accepted Answer

import re

ss =  '''Princess Maria Amelia of Brazil (1831–1853)


was the daughter of Dom Pedro I,
founder of Brazil's independence and its first emperor,

and Amelie of Leuchtenberg.



The only child from her father's second marriage,
Maria Amelia was born in France
following Pedro I's 1831 abdication in favor of his son Dom Pedro II.

Before Maria Amelia was a month old, Pedro I left for Portugal
to restore its crown to his eldest daughter Dona Maria II.
He defeated his brother Miguel I (who had usurped Maria II's throne),
only to die a few months later of tuberculosis.


'''

def select_lines(input,regx = re.compile('((?:^.+\n)+)',re.MULTILINE)):
    return [x.splitlines() for x in regx.findall(input)]

for sl in  select_lines(ss):
    print sl
    print

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

另一种方式，对列表采取行动：

li = [ '', '2', '3', '', '5', '6', '7', '8', '', '', '11']

lo = ['5055','','','2','54','87','','1','2','5','8','','']

lu = ['AAAAA','BB','','HU','JU','GU']

def selines(L):
    ye = []
    for x in L:
        if x:
            ye.append(x)
        elif ye:
            yield ye ; ye = []
    if ye:
        yield ye



for lx in (li,lo,lu):
    print lx
    print list(selines(lx))
    print

结果

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

score 2 · Accepted Answer

比原来的丑一点：

def split_at_empty(lines):
    r = [[]]
    for l in lines:
        if l:
            r[-1].append(l)
        else:
            r.append([])
    return [l for l in r if l]

（最后一行去掉了原本会被添加的空列表。）

score 1 · Accepted Answer

对于列表理解的痴迷者......

def split_at_empty(L):
    return [L[start:end+1] for start, end in zip(
        [n for n in xrange(len(L)) if L[n] and (n == 0 or not L[n-1])],
        [n for n in xrange(len(L)) if L[n] and (n+1 == len(L) or not L[n+1])]
        )]

或更好

def split_at_empty(lines):
    L = [i for i, a in enumerate(lines) if not a]
    return [lines[s + 1:e] for s, e in zip([-1] + L, L + [len(lines)]) 
            if e > s + 1]

score 0 · Accepted Answer

您可以将列表组合成一个字符串，然后重新拆分它：

>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
>>> [x.strip().split(' ') for x in ' '.join(a).split('  ')]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

而且您可能应该使用正则表达式来捕获任意数量的空格（我在此处的“11”之前添加了另一个空格）：

>>> import re
>>> pat = re.compile(r'\s{2,}')
>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '', '11']
>>> [x.strip().split(' ') for x in pat.split(' '.join(a))]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

score 0 · Accepted Answer

这是一个基于生成器的解决方案：

def split_at_empty(lines):
   sep = [0] + [i for (i,l) in enumerate(lines) if not l] + [len(lines)]
   for start, end in zip(sep[:-1], sep[1:]):
      if start + 1 < end:
         yield lines[start+1:end]

对于您的输入：

l = ['' , '2' , '3' , '' , '5' , '6' , '7' , '8' , '' , '' , '11']
for para in split_at_empty(l):
   print para

它产生

['2', '3']
['5', '6', '7', '8']
['11']

python - 将行拆分为段落

5 回答 5

Related

Reference