python - 如何从网页中抓取“正确”的照片？

Question

从网站上抓取正确的照片：我正在制作一个简单的新闻应用程序。我有这篇文章，但我需要选择正确的照片。

例如，在：

http://www.politico.com/story/2013/09/government-shutdown-2013-gop-narrative-97521.html

我想抓取 3 个人照片的 url。但是有几个图像要刮。我怎么知道哪张是正确的照片。news.google 和flipboard 做什么逻辑从文章或任何文章中刮取“正确”的照片。

我注意到大多数时候这些照片都在幻灯片中。我如何使用 Beautiful Soup 来抓取这些幻灯片的照片。

score 4 · Accepted Answer

该页面有一个符合开放图形协议的元标记：

<meta property="og:image" content="http://images.politico.com/global/2013/09/29/mccarthy_blackburn_cruz_ap_ftn_ap_328.jpg"/>

这给出了网站创建者建议用作预览的图像（这确实是三个人的照片）。

您可以像这样使用 BeautifulSoup 获取此图像的地址：

import urllib2
from bs4 import BeautifulSoup

url = "http://www.politico.com/story/2013/09/government-shutdown-2013-gop-narrative-97521.html"
bs = BeautifulSoup(urllib2.urlopen(url))

metatag = bs.find("meta", {"property": "og:image"})
if metatag is not None:
    print metatag["content"]
else:
    print "This page has no Open Graph meta image tag"

python - 如何从网页中抓取“正确”的照片？

1 回答 1

Related

Reference