0

尝试获取 domain.zz 或 domain.zzz 或 domain.zz.zz 或 /something。

import re
the_string = """lalalla?url=http2F%2Fdomain.zz%slgkfgs0s"""
the_string = """lalalla?url=http2F%2Fdomain.zz.zz/something%slgkfgs0sf"""
the_string = """lalalla?url=randomh564domain.zzz/something%slgkfgs0sf"""
the_string = """lalalla?url=randomeefsdlk876%domain.zz/something%slgkfgs0sf"""
the_string = """p%3A%2F%2Fdummy_test.com/ratata%2F&amp"""
the_string = """p%3A%2F%2Fdum2test.co.uk/something%2F&-kj"""

这就是我现在拥有的:

>>> print( re.findall('(?:www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4})(?:/[a-z0-9]+)',the_string))
domain.zzz/something
domain.zz/something
domain.zz.zz/something

>>> print( re.findall('www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}',the_string))
domain.zzz
domain.zz
domain.zz.zz

我想让这两组人回答一个问题。

编辑:这个几乎是完美的:'([a-z0-9.-]+[.][az]{2,4})|(?:/[a-z0-9]+)' 但它抓住了字符串开头的一些垃圾。

该字符串比本例中的随机得多:我关注的是这 3 种情况:

domain.co.uk/something
      ^  ^  ^
domain.com/something
      ^   ^
domain.com
      ^   
4

2 回答 2

1

试试这个,我不知道这是否完全符合你的要求,但也许你可以澄清问题,如果有问题,可以进一步模式......

print re.findall('=(?:[^@%/.]*(?:@|%(?:2F)?))?(?:www.)?(?P<domain>[^%@/]*)(?:/(?P<folder>[^%]*))?(?:[%@/].*)?$',the_string,re.MULTILINE)

如果您愿意,您可以使用match.group('domain')match.group('folder')

输出:

[('domain.zz', ''), ('domain.zz.zz', 'something'), ('randomh564domain.zzz', 'something'), ('domain.zz', 'something'), ('domain.zz.zz', 'something'), ('domain.zzz', 'something')]
于 2013-03-07T11:24:50.627 回答
1

这个怎么样:

import re
the_string = """lalalla?url=http@domain.zz%slgkfgs0sf"""
the_string = """lalalla?url=http@domain.zz.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=http@domain.zzz/something%slgkfgs0sf"""
#the_string = """lalalla?url=ht%domain.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=httpsd%domain.zz.zz/something%slgkfgs0sf"""
#the_string = """lalalla?url=www.domain.zzz/something%slgkfgs0sf"""

test = re.compile('(?P<base>[a-zA-Z0-9_\-\.]*?[a-zA-Z0-9_\-]+\.[z\.]+)(?P<extra>/[a-zA-Z0-9_\-]+)')

for match in test.finditer(the_string):
    print(match.group('base'))
    print(match.group('extra'))

输出:

domain.zz.zz
/something

这样,您将拥有“base”和“extra”中的两个数据......将它们组合起来以再次获得完整的字符串。

编辑:更新了模式以获得更好的域匹配并更改了 python 3 语法的打印

于 2013-03-07T11:04:38.710 回答