

def process(document, pagewidth, margins, formats):
    res = []
    onlypw = []
    pwmarg = []
    count = 0
    marg = 0

    for segment in margins: 

        for i in range(count, segment[0]):
        text = ''

    foundmargin = -1
    for i in range(segment[0], segment[1]+1):
        marg = segment[2]
        text = text + '\n' + document[i].strip(' ')

    words = text.split()

注意:segment [0] 表示文档的开头,如果您想知道范围,segment[1] 仅表示文档的结尾。我的问题是当我将文本复制到单词时(在 words=text.split() 中)它不会保留我的空行。我应该得到的输出是:

      This is my substitute for pistol and ball. With a
      philosophical flourish Cato throws himself upon his sword; I
      quietly take to the ship. There is nothing surprising in
      this. If they but knew it, almost all men in their degree,
      some time or other, cherish very nearly the same feelings
      towards the ocean with me.

      There now is your insular city of the Manhattoes, belted
      round by wharves as Indian isles by coral reefs--commerce
      surrounds it with her surf.


      This is my substitute for pistol and ball. With a
      philosophical flourish Cato throws himself upon his sword; I
      quietly take to the ship. There is nothing surprising in
      this. If they but knew it, almost all men in their degree,
      some time or other, cherish very nearly the same feelings
      towards the ocean with me. There now is your insular city of
      the Manhattoes, belted round by wharves as Indian isles by
      coral reefs--commerce surrounds it with her surf. 



2 回答 2


首先拆分至少 2 个换行符,然后拆分单词:

import re

paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]


我曾经re.split()支持由超过 2 个换行符分隔的段落;text.split('\n\n')如果段落之间只有 2 个换行符,您可以使用简单的。

于 2013-03-14T20:26:34.833 回答


m = re.compile('(\S+|\n\n)')
于 2013-03-14T20:34:07.743 回答