python - 从正则表达式中的字符串中准确提取四个整数

Question

list1 = ['Contact: Hamdan Z Hamdan, MBBS, Msc',
        '\r\n            ',
        '+249912468264',
        '\r\n                  ',
        'hamdanology@hotmail.com',
        '\r\n                ',
        'Contact: Maha I Mohammed, MBBS, PhD',
        '\r\n            ',
        '+249912230895',
        '\r\n                  ',
        '\r\n                ',
        'Sudan',
        'Jaber abo aliz',
        '\r\n                  ',
        'Recruiting',
        '\r\n          ',
        'Khartoum, Sudan, 1111  ',
        u'Contact: Khaled H Bakheet, MD,PhD \xa0 \xa0 +249912957764 \xa0 \xa0 ',
        'khalid2_3456@yahoo.com',
        u' \xa0 \xa0 ',
        u'Principal Investigator: Hamdan Z Hamdan, MBBS,MSc \xa0 \xa0  \xa0 \xa0  \xa0 \xa0 ',
       'Principal Investigator:',
       '\r\n      ',
       'Hamdan Z Hamdan, MBBS, MSc',
       '\r\n            ',
        'Al-Neelain University',
        '\r\n                '
    ]

从这个字符串列表中，我只需要提取不应与其他字符关联的 4 位整数吗？

示例：“1111”只是所需的输出。

我们应该如何在 python 中编写正则表达式？显然，这是行不通的：*([\d]{4})*.

score 6 · Accepted Answer

您可以\b在正则表达式中使用来指示单词边界，因此以下内容对您有用：

import re

for s in list1:
    m = re.search(r'\b\d{4}\b', s)
    if m:
        print m.group(0)

...它只是输出1111. 的文档\b进一步解释：

\b

匹配空字符串，但只匹配单词的开头或结尾。单词被定义为字母数字或下划线字符的序列，因此单词的结尾由空格或非字母数字、非下划线字符表示。[...]

score 3 · Accepted Answer

您可以尝试以下方法

>>> [l for l in (re.findall(r"[^\d](\d{4})[^\d]",s) for s in list1) if l]
[['1111'], ['3456']]

如果您只对字边界使用的四位数字感兴趣

>>> [l for l in (re.findall(r"\b\d{4}\b",s) for s in list1) if l]
[['1111']]

python - 从正则表达式中的字符串中准确提取四个整数

2 回答 2

Related

Reference