我在本地机器上保存了静态 HTML 页面。我尝试使用简单的文件打开和 BeautifulSoup。打开文件时,由于 unicode 错误,它不会读取整个 html 文件,而 BeautifulSoup 它适用于实时网站。
#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)
#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
for line in f:
print(repr(line))
阅读 HTML 后,我希望从中提取数据ul
并且li
没有任何属性。欢迎任何建议。