python - 使用 urllib 计算网页上的图像数量

Question

对于一堂课，我有一个练习，我需要计算任何给定网页上的图像数量。我知道每张图片都以开头，所以我使用正则表达式来尝试定位它们。但是我一直在数一个我知道是错误的，我的代码有什么问题：

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

score 10 · Accepted Answer

永远不要使用正则表达式来解析 HTML，使用 html 解析器，如lxml或BeautifulSoup。这是一个工作示例，如何使用和requestsimg获取标签计数：BeautifulSoup

from bs4 import BeautifulSoup
import requests


def get_img_cnt(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    return len(soup.find_all('img'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

lxml这是一个使用and的工作示例requests：

from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

两个片段都打印106.

另见：

希望有帮助。

score 2 · Accepted Answer

啊啊正则表达式。

您的正则表达式模式<img.*>说“给我找一些以 and 开头的<img东西，并确保它以>.

不过，正则表达式是贪婪的。它会.*用它所能做的一切来填充它，同时在某个地方留下一个>字符来满足模式。在这种情况下，它会一直走到最后，<html>并说“看！我在>那儿找到了一个！”

您应该通过.*非贪婪来计算正确的计数，如下所示：

<img.*?>

score 1 · Accepted Answer

您的正则表达式是贪婪的，因此它匹配的内容比您想要的要多得多。我建议使用 HTML 解析器。

img_pat = re.compile('<img.*?>',re.I)如果您必须以正则表达式的方式进行操作，将会成功。这?使它不贪婪。

一个很好的网站，可以即时检查您的正则表达式匹配的内容：http: //www.pyregex.com/
了解更多关于正则表达式：http ://docs.python.org/2/library/re.html

python - 使用 urllib 计算网页上的图像数量

3 回答 3

Related

Reference