6

我是 python 新手,想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的 URL。谷歌搜索后,似乎有很多解决方案可以提取域,用链接替换它等,但没有一个可以从字符串中删除/删除它们。我在下面提到了一些例子供参考。谢谢!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

错误日志:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
4

2 回答 2

8

您的代码中有一个错误(实际上是两个):

1.你应该在倒数第二个单引号前加一个反斜杠来转义它:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

2.你不应该使用str作为变量的名称,因为它是一个保留关键字,所以命名它thestring或其他任何东西

例如:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

结果:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

于 2012-12-29T11:59:37.713 回答
7

在源文件顶部包含编码行(正则表达式字符串包含非 ascii 符号,如»),例如:

# -*- coding: utf-8 -*-
import re
...

还要用三重单引号(或双引号)包围您的正则表达式字符串 -'''或者"""代替单引号,因为该字符串本身已经包含引号符号('")。

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
于 2012-12-29T11:25:11.993 回答