python - 当这个字符串的行尾有数字时，为什么这个正则表达式不排除该行

Question

我正在尝试扫描文档并确定文档各部分的开始和结束位置。有时，文档有一个目录，其中列出了我不想捕获 TOC 的页码，因为它不能识别文档的一部分。我已经搞砸了一段时间，并且被困在了一些事情上。我似乎无法避免使用行号从目录中捕获行

这是正则表达式

verbose_item_pattern_3 = re.compile(r"""
  ^            # begin match at newline
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blank space
  I            # a capital I
  [tT][eE][mM] # one character from each of the three sets this allows for unknown case
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blankspace
  \d{1,2}      # 1-or-2 digits
  [.]?         # 0-or-1 literal .
  \(?          # 0-or-1 literal open paren
  [a-e]?       # 0-or-1 letter in the range a-e
  \)?          # 0-or-1 closing paren
  .*           # any number of unknown characters so we can have words and punctuation
  [^0-9]       # anything but [0-9]
  $           # 1 newline character
  """, re.VERBOSE|re.MULTILINE)

这是我不想捕获的行的示例

test_string='\nItem 6.       TITLE ITEM 6..................................................25\n'

这是我想要捕捉的一个例子

test_string='\nItem 6.       TITLE ITEM 6 maybe other words here who knows  \n'

但是当我跑步时

re.findall(verbose_item_pattern_3,test_string)

结果是

['Item 6.       TITLE ITEM 6..................................................25\n']

现在对我来说有趣的是，如果我的测试字符串是这个

test_string='PART I\nItem 1.       TITLE ITEM 1...................................................1\nItem 2.       TITLE ITEM 2..................................................21\n'

并使用 re.findall(verbose_item_pattern_3,test_string) 运行它

结果更接近我想要的但仍然不正确

['Item 2.       TITLE ITEM 2..................................................21\n']

不应该有任何东西被捕获

score 2 · Accepted Answer

您的正则表达式匹配是因为三件事。

大部分都是可选的，所以非常不具体
有一个.*吃掉整条线，所以你的最后一个条件[^0-9]永远不会出现，那是因为：
换行符本身满足 [^0-9]，因此[^0-9]即使该行以数字结尾，也可以成功匹配。

最小的变化是在最后使用负面的后视：

verbose_item_pattern_3 = re.compile(r"""
  ^            # start-of-line
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blank space
  I            # a capital I
  [tT][eE][mM] # one character from each of the three sets this allows for unknown case
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blankspace
  \d{1,2}      # 1-or-2 digits
  [.]?         # 0-or-1 literal .
  \(?          # 0-or-1 literal open paren
  [a-e]?       # 0-or-1 letter in the range a-e
  \)?          # 0-or-1 closing paren
  .*           # any number of unknown characters so we can have words and punctuation
  $            # end-of-line
  (?<![0-9])   # NOT preceded by a decimal digit (via look-behind)
  """, re.VERBOSE|re.MULTILINE)

请注意，两者^都不是$实际匹配换行符。它们匹配( )之后^的位置或( ) 换行符之前的位置。$换行符本身绝不是匹配的一部分。

出于这个原因，start-of-line我已将他们的评论更改为更准确。end-of-line

另请注意，即使在$. 这样做有助于防止回溯，使正则表达式更快。

score 2 · Accepted Answer

如果我理解正确，您希望您的示例字符串不匹配，因为该行中的最后一个字符是一个数字，并且您的正则表达式以[^0-9]$.

这行为不正确的原因是它$会在 a 之前匹配\n，但也会在字符串的最后匹配。这里最终发生的是.*匹配数字，然后[^0-9]匹配\n, 并$匹配字符串末尾。考虑以下示例，该示例使用捕获组来展示其工作原理：

>>> re.match(r'(.*)([^0-9])$', '...12\n').groups()
('...12', '\n')

要解决此问题，您可以[^0-9]通过将其更改为来防止匹配换行符[^0-9\n]：

verbose_item_pattern_3 = re.compile(r"""
  ^            # begin match at newline
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blank space
  I            # a capital I
  [tT][eE][mM] # one character from each of the three sets this allows for unknown case
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blankspace
  \d{1,2}      # 1-or-2 digits
  [.]?         # 0-or-1 literal .
  \(?          # 0-or-1 literal open paren
  [a-e]?       # 0-or-1 letter in the range a-e
  \)?          # 0-or-1 closing paren
  .*           # any number of unknown characters so we can have words and punctuation
  [^0-9\n]     # anything but [0-9] and line breaks
  $           # 1 newline character
  """, re.VERBOSE|re.MULTILINE)

示例（使用上述正则表达式）：

>>> verbose_item_pattern_3.findall('\nItem 6.       TITLE ITEM 6.....25\n')
[]
>>> verbose_item_pattern_3.findall('\nItem 6.       TITLE ITEM 6.....\n')
['Item 6.       TITLE ITEM 6.....']

python - 当这个字符串的行尾有数字时，为什么这个正则表达式不排除该行

2 回答 2

Related