python - 第一个字符被删除（正则表达式）

Question

我有这个正则表达式： (?<=[.!?])\s[A-Z] 我在这个文本上运行它：

The engineering plant, weapon and electronic systems, galley, and multitudinous other
equipment required to transform the new hull into an operating and habitable warship are
installed and tested. The prospective commanding officer, ship's officers, the petty
officers, and seamen who will form the crew report for training and intensive
familiarization with their new ship.

它产生：

he engineering plant, weapon and electronic systems, galley, and multitudinous other
equipment required to transform the new hull into an operating and habitable warship are
installed and tested.
he prospective commanding officer, ship's officers, the petty officers, and seamen who
will form the crew report for training and intensive familiarization with their new ship.

如您所见，它删除了句子的第一个字母。这不是因为它们是大写的（我测试过）。

如何修复它，使其不会删除句子的第一个字母？

（我正在使用 Python 3）

我使用了 re.split() 然后我打印了数组，用换行符分隔每个值

score 2 · Accepted Answer

您的正则表达式匹配空白字符和大写 ASCII 字母，但前提是它们前面有一个点、感叹号或问号。

当您使用它来拆分文本时，大写字母成为用于拆分的分隔符的一部分，因此被删除。

将正则表达式更改为

(?<=[.!?])\s(?=[A-Z])

并且这封信不会成为匹配的一部分。

但是，请注意两件事：

这仅在新句子以 ASCII 字母开头时才有效。对于大多数英语句子，您可能会没事，但对于其他语言肯定不行。
如果您的文本包含缩写，则可能会出现一些错误的拆分：Mr. Smith并将Dr. Jones被拆分为两部分。

score 1 · Accepted Answer

问题出在您的正则表达式上，奇怪的是，当您使用“非消耗性标记”（即，积极的后视）作为标点符号（(?<=[.!?])）时，您没有检测每个句子的第一个字母（[A-Z]）。

因此，您使用的正则表达式split()将消耗每个匹配项的第一个大写字母。您可能打算不使用它（即仅使用其间的空间），在这种情况下，您希望使用不使用文本的正向前瞻：

(?<=[.!?])\s(?=[A-Z])

Lookaheads 和 lookbehinds 通常是锚点，并且锚点不消耗输入中的任何文本。最常用的锚点当然是^和$。它们只匹配输入文本中的位置，这是您想要的。

向后查找将匹配该位置的前面文本必须匹配/不匹配给定正则表达式的位置，而前瞻将匹配该位置的后续文本必须匹配/不匹配给定正则表达式的位置。在匹配的空格之后，您想要的是后面是大写字母的位置，因此使用与大写字母 ( being ) 匹配的正前瞻 ( (?=<re>)，其中是正则表达式)。<re><re>[A-Z]

python - 第一个字符被删除（正则表达式）

2 回答 2

Related

Reference