python - 用 Python 计算 HTML 图像

Question

在提取 HTML 图像后，我需要一些关于如何使用 Python 3.01 计算 HTML 图像的反馈，也许我的正则表达式没有正确使用。

这是我的代码：

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass

score 1 · Accepted Answer

使用 beautifulsoup4（一个 html 解析器）而不是正则表达式：

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))

score 0 · Accepted Answer

关于您的代码的几点：

使用专门的 HTML 解析库来解析页面要容易得多（这是 python 方式）。我个人更喜欢Beautiful Soup
你line在循环中覆盖了你的变量
total将始终为 0 与您当前的逻辑
无需编译你的 RE，因为它会被解释器缓存
你正在丢弃你的异常，所以没有关于代码中发生了什么的线索！
标签可能还有其他属性<img>..所以你的正则表达式有点基本，另外，使用该re.findall()方法在同一行捕获多个实例......

稍微改变一下你的代码，我得到：

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")

python - 用 Python 计算 HTML 图像

2 回答 2

Related

Reference