python - 如何解析包含 url 的字符串，将它们更改为正确的链接

Question

假设我有一个来自 twitter 的以下字符串：

"This is my sample test blah blah http://t.co/pE6JSwG, hello all"

我如何解析这个字符串，将这个链接更改为<a href="link">link</a>？这是解析用户标签的代码：

    tweet = s.text;
    user_regex = re.compile(r'@[0-9a-zA-Z+_]*',re.IGNORECASE)

    for tt in user_regex.finditer(tweet):
        url_tweet = tt.group(0).replace('@','')
        tweet = tweet.replace(tt.group(0),
            '<a href="http://twitter.com/'+
            url_tweet+'" title="'+
            tt.group(0)+'">'+
            tt.group(0)+'</a>')

我当前的 url 正则表达式：

    http_regex = re.compile(r'[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]*', re.IGNORECASE)

score 1 · Accepted Answer

1

也许您可以从django-oembed项目的源代码中获得灵感。

于 2010-12-03T15:28:33.513 回答

score 1 · Accepted Answer

>>> test = "This is my sample test blah blah http://t.co/pE6JSwG, hello all"

>>> re.sub('http://[^ ,]*', lambda t: "<a href='%s'>%s</a>" % (t.group(0), t.group(0)), test)

>>> This is my sample test blah blah <a href='http://t.co/pE6JSwG'>http://t.co/pE6JSwG</a>, hello all

仅当您将逗号和空格等字符视为您的网址的有效停止点时，这才有效。

一般来说，您可能不应该使用正则表达式进行 url 匹配，因为可能没有一个好方法可以知道 URL 何时结束。如果保证每次都有相同格式的字符串，则此解决方案将起作用。您也可能总是获得相同长度的 URL，在这种情况下，您可以查找 http 并随后收集该长度的子字符串。

python - 如何解析包含 url 的字符串，将它们更改为正确的链接

2 回答 2

Related

Reference