python-2.7 - 尝试使用 python 和 re 获取 reddit.com 的所有图像链接

Question

我查看了其他帖子，并尝试将他们所说的内容实现到我的代码中，但我仍然遗漏了一些东西。

我想要做的是从网站上获取所有图像链接，特别是 reddit.com，一旦我获得在浏览器中显示图像或下载它们并通过 Windows Image Viewer 显示它们的链接。我只是想练习和拓宽我的 Python 技能。

我一直在获取链接并选择如何显示图像。我现在拥有的是：

import urllib2
import re
links=urllib2.urlopen("http://www.reddit.com").read()
found=re.findall("http://imgur.com/+\w+.jpg", links)
print found #Just for testing purposes, to see what links are found

谢谢您的帮助。

score 3 · Accepted Answer

imgur.comreddit 上的链接没有任何.jpg扩展名，因此您的正则表达式不会匹配任何内容。您应该寻找i.imgur.com域。

匹配re.findall("http://i.imgur.com/\w+.jpg", links)确实返回结果：

>>> re.findall("http://i.imgur.com/\w+.jpg", links)
['http://i.imgur.com/PMNZ2.jpg', 'http://i.imgur.com/akg4l.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/z2wIl.jpg', 'http://i.imgur.com/z2wIl.jpg']

您可以将其扩展为其他文件扩展名：

>>> re.findall("http://i.imgur.com/\w+.(?:jpg|gif|png)", links)
['http://i.imgur.com/PMNZ2.jpg', 'http://i.imgur.com/akg4l.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/dAHtq.jpg', 'http://i.imgur.com/rsIfN.png', 'http://i.imgur.com/rsIfN.png', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/nT73r.jpg', 'http://i.imgur.com/bPs5N.gif', 'http://i.imgur.com/z2wIl.jpg', 'http://i.imgur.com/z2wIl.jpg']

您可能想要使用适当的 HTML 解析器而不是正则表达式；我可以推荐BeautifulSoup和lxml. 这将使查找所有<img />使用这些工具链接的标签变得更加容易i.imgur.com，包括.gif和.png文件，如果有的话。

python-2.7 - 尝试使用 python 和 re 获取 reddit.com 的所有图像链接

1 回答 1

Related

Reference