4

给定以下文本:

text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"

我需要:

["Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.",
 "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
 "She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]",
 "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

我试过这个但它不起作用:

new_line = re.split('(?<=\.) |(([.?!](\[\d+\])+))\s', text)
print(new_line)

我得到的结果是这样的:

['Van der Weyden was preoccupied by commissioned\xa0portraiture\xa0towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.', None, None, None, "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers", '.[2]', '.[2]', '[2]', 'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress', '.[3][4][5]', '.[3][4][5]', '[5]', "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]
4

2 回答 2

2

您可以使用

re.findall(r'(?s)(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text)

请参阅正则表达式演示详情

  • (?s)re.S- 与or相同re.DOTALL,跨行.匹配
  • (.*?(?:\.|[.?!](?:\[\d+\])+))- 第 1 组:
    • .*?- 尽可能少的零个或多个字符
    • (?:\.|[.?!](?:\[\d+\])+)- 一个点或一个./ ?/!以及[+ digit(s) + ]substring的一个或多个出现
  • (?:\s+|\s*\Z)- 一个或多个空格或零个或多个空格,后跟字符串结尾。

请参阅Python 演示

import re
text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
print( re.findall(r'(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text, re.DOTALL) )

输出:

[
  'Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.',
  "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
  'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]',
  "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
]
于 2022-01-18T21:53:04.870 回答
0

您需要使用非捕获组 ( (?:...)) 或re.split将捕获的部分包含在输出中:

import re
new_line = re.split(r'(?<=\.) |(?:[.?!](?:\[\d+\])+)\s', text)
print(new_line)
于 2022-01-18T21:53:19.567 回答