我是 python 新手,想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的 URL。谷歌搜索后,似乎有很多解决方案可以提取域,用链接替换它等,但没有一个可以从字符串中删除/删除它们。我在下面提到了一些例子供参考。谢谢!
str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'
URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|
(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)
print '==' + URLless_string + '=='
错误日志:
C:\Python27>python test.py
File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details