python - Python：用字符串中的标题名称替换网址

Question

我想从字符串中删除 url 并将它们替换为原始内容的标题。

例如：

mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"

sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"

为了用标题替换 url，我写了这个snipplet：

#get_title: string -> string
def get_title(url):
    """Returns the title of the input URL"""

    output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
    return output.title.string

我不知何故需要将此函数应用于捕获url并通过get_title转换为标题的字符串。

score 3 · Accepted Answer

这是一个有关在 Python 中验证 url 的信息的问题：How do you validate a URL with a regular expression in Python?

urlparse模块可能是你最好的选择。您仍然必须决定在您的应用程序上下文中构成有效 url 的内容。

要检查 url 的字符串，您需要遍历字符串中的每个单词，检查它，然后用标题替换有效的 url。

示例代码（您需要编写 valid_url）：

def sanitize(mystring):
  for word in mystring.split(" "):
    if valid_url(word):
      mystring = mystring.replace(word, get_title(word))
  return mystring

score 2 · Accepted Answer

您可能可以使用正则表达式和替换来解决这个问题（re.sub 接受一个函数，该函数将为每次出现的 Match 对象传递并返回替换它的字符串）：

url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)

困难的是创建一个匹配 URL 的正则表达式，而不是更多，而不是更少。

python - Python：用字符串中的标题名称替换网址

2 回答 2

Related

Reference