python - 从字符串中解析几个 FQDN

Question

给定一个主域，我试图从一个字符串中提取它及其子域。
例如，对于example.co我想要的主域：

仅提取主域和子域 - example.co, www.example.co,uat.smile.example.co
不选择向右延伸的名称 - 不www.example.com，www.example.co.nz
忽略 FQDN 中任何不合法的空格或标点符号作为分隔符

目前我正在从以下位置获取不需要的项目：
example.com
example.co.nz
还test-me.www.example.co包括尾随空格。

>>> domain = 'example\.co'
>>> line = 'example.com example.co.nz www.example.co. test-me.www.example.co bad.example-co.co'
>>> re.findall("[^\s\',]*{}[\s\'\,]*".format(domain), line)
['example.co', 'example.co', 'www.example.co', 'test-me.www.example.co ']

我应该使用正则表达式吗？如果是这样，我们将不胜感激有关解决此问题的指导。
否则有没有更好的工具来完成这项工作？

编辑- 已验证 Marc Lambrichs 的回答，但在以下情况下失败：

import re

pattern = r"((?:[a-zA-Z][\w-]+\.)+{}(?!\w))"
domain = 'google.com'
line = 'google.com mail is handled by 20 alt1.aspmx.l.google.com.'
results = re.findall(pattern.format(re.escape(domain)), line)
print(results)
[]

另外，我想传递像“google.com”这样的字符串而不是“google.com”并转义，re但re.escape(domain)代码以任何方式返回空列表。

score 2 · Accepted Answer

您可以为此使用正则表达式，而无需进行任何拆分。

$ cat test.py
import re

tests = { 'example.co': 'example.com example.co.nz www.example.co. test-me.www.example.co bad.example-co.co',
          'google.com': 'google.com mail is handled by 20 alt1.aspmx.l.google.com.'}


pattern = r"((?:[a-zA-Z][-\w]*\.)*{}(?!\w))"

for domain,line in tests.iteritems():
    domain = domain.replace(".", "\\.")
    results = re.findall(pattern.format(domain), line)
    print results

结果：

$ python test.py
['google.com', 'alt1.aspmx.l.google.com']
['example.co', 'www.example.co', 'test-me.www.example.co']

正则表达式的解释

(                  # group 1 start
  (?:              # non-capture group
     [a-zA-Z]      # rfc 1034. start subdomain with a letter
     [\w-]*\.      # 0 or more word chars or '-', followed by '.'
  )*               # repeat this non-capture group 0 or more times
  example.co       # match the domain
  (?!\w)           # negative lookahead: no following word char allowed.
)                  # group 1 end

python - 从字符串中解析几个 FQDN

1 回答 1

Related

Reference