python - 正则表达式匹配以空格开头的行继续？

Question

我正在尝试使用正则表达式从文本文件中提取“条目”。文件的每一行都是一个单独的条目，除非该行以空格开头，在这种情况下，该行是前一行的延续。

例子：

import re

INPUT = """\
This is entry 1.
This
 is
  entry 2.
And this is entry 3.
This
 is
 entry
 4."""

OUTPUT = ["This is entry 1.",
          "This\n is\n  entry 2.",
          "And this is entry 3.",
          "This\n is\n entry\n 4."]

# What should the pattern be?
PATTERN = re.compile("(.+)(?=\n|$)")

assert PATTERN.findall(INPUT) == OUTPUT

什么应该PATTERN匹配所有条目？

score 0 · Accepted Answer

0

OUTPUT = re.sub("[^\S\n]*\n[^\S\n]+", " ", INPUT).split("\n");

请参阅此演示。

于 2012-11-26T20:44:37.217 回答

score 0 · Accepted Answer

0

我在 Java 中测试过的正则表达式

^\S[.\s\w\r\n]*?(?=\n\S|\Z)

于 2012-11-26T19:51:22.650 回答

score 0 · Accepted Answer

如果我们可以依靠大写句子的第一个字母，我认为解决此问题的好方法是以下正则表达式：

re.findall(r'\w[\w\s]+?\.', INPUT)

在实践中，使用您的值INPUT：

>>> re.findall(r'\w[\w\s]+?\.', INPUT)
['This is entry 1.', 'This\n is\n  entry 2.', 'And this is entry 3.', 'This\n is\n entry\n 4.']

我写的正则表达式\w在 the 之前有一个正确的[\w\s]+?，以确保每个匹配都从句子的开头开始，而不是前面的空格。

score 0 · Accepted Answer

In [92]: re.findall(r'(.+(?:\n\s.*)*)\n?', INPUT)
Out[92]: 
['This is entry 1.',
 'This\n is\n  entry 2.',
 'And this is entry 3.',
 'This\n is\n entry\n 4.']

In [93]: OUTPUT == re.findall(r'(.+(?:\n\s.*)*)\n?', INPUT)
Out[93]: True

score -1 · Accepted Answer

我想明白了。

诀窍是“ .（不匹配换行符）或换行符后跟空格”。

PATTERN = re.compile(r"(?:.|\n\s)+")

python - 正则表达式匹配以空格开头的行继续？

5 回答 5

请参阅此演示。

Related

Reference