python - 使用 urllib 搜索源码

Question

我正在尝试编写一个脚本来搜索网站源代码中的文本。我有它，所以它成功地抓取了源代码并将其打印出来，看起来像： b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE html......等等

但是，当尝试使用搜索在代码中查找“div”标签时print(page.find('div'))，我收到一条错误消息，指出TypeError: Type str doesn't support the buffer API我认为这与我收到一个字节文字有关。如何将其编码为 UTF-8 或 ASCII 以便能够搜索字符串？

如果需要，这是我正在运行的简单代码：

import urllib.request
from urllib.error import URLError

def get_page(url):
  #make the request
  req = urllib.request.Request(url)
  the_page = urllib.request.urlopen(req)

  #get the results of the request
  try:
    #read the page
    page = the_page.read()
    print(page)
    print(page.find('div'))

  #except error
  except URLError as e:
    #if error has a reason (thus is url error) print the reason
    if hasattr(e, 'reason'):
      print(e.reason)
    #if error has a code (thus is html error) print the code and the error
    if hasattr(e, 'code'):
      print(e.code)
      print(e.read())

score 0 · Accepted Answer

我认为您正在使用 Python v.3（如 print 作为函数而不是语句所述）。

在 Python 3 中，page是一个字节对象。所以你也需要使用字节对象来搜索它。试试这个：

print(page.find(b'div'))

希望这可以帮助

python - 使用 urllib 搜索源码

1 回答 1

Related

Reference