python - Python：获取重复数字集的正则表达式

Question

我正在处理一个文件，这是一个 Genbank 条目（类似于这个）

我的目标是提取 CDS 行中的数字，例如：

    CDS             join(1200..1401,3490..4302)

但我的正则表达式也应该能够从多行中提取数字，如下所示：

     CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                 2300..2660,2800..2999,3100..3333)

我正在使用这个正则表达式：

     import re
     match=re.compile('\w+\D+\W*(\d+)\D*')
     result=match.findall(line)
     print(result)

这给了我正确的数字，但也给了我文件其余部分的数字，比如

 gene            complement(3300..4037)

那么如何更改我的正则表达式以获取数字？我应该只在它上面使用正则表达式..

我将使用这些数字来打印基本序列的编码部分。

score 1 · Accepted Answer

您可以使用Matthew Barnettregex大大改进的模块（它提供了功能）。有了这个，你可以想出以下代码：\G

import regex as re
rx = re.compile("""
            (?:
                CDS\s+join\(    # look for CDS, followed by whitespace and join(
                |               # OR
                (?!\A)\G        # make sure it's not the start of the string and \G 
                [.,\s]+         # followed by ., or whitespace
            )
            (\d+)               # capture these digits
                """, re.VERBOSE)

string = """
         CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                     2300..2660,2800..2999,3100..3333)
"""

numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']

\G确保正则表达式引擎在最后一场比赛结束时寻找下一场比赛。
请参阅regex101.com 上的演示（PHP因为模拟器没有为Python[它使用原始re模块] 提供相同的功能）。

一个差得多的解决方案（如果您只被允许使用该re模块），将使用环视：

(?<=[(.,\s])(\d+)(?=[,.)])

(?<=)是积极的展望，而是积极的展望，请在regex101.com上查看此方法的演示。请注意，尽管可能存在一些误报。(?=)

score 0 · Accepted Answer

以下re模式可能有效：

>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))

但是您需要调用findall整个文本正文，而不仅仅是一次一行。

您可以使用括号来提取数字：

>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302       # Numbers only

让我知道这是否达到了你想要的！

python - Python：获取重复数字集的正则表达式

2 回答 2

Related

Reference