python - Python网页抓取返回错误

Question

我目前正在学习 Python 并尝试学习网络抓取。我一直在使用从一些教程中获得的示例代码，但在我正在查看的网站之一中遇到了问题。以下代码应该返回网站的标题：

import urllib
import re
urls = ["http://www.libyaherald.com"]
i=0
regex='<title>(.+?)</title>'
pattern = re.compile(regex)
while i< len(urls):
    htmlfile = urllib.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles=re.findall(pattern,htmltext)
    print titles
    i+=1

利比亚先驱报网站的标题返回了一个错误。我检查了《利比亚先驱报》的源代码，文档类型为<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">.

文档类型是否与我无法从中刮擦有关？

score 0 · Accepted Answer

对于严重的 python 网络抓取，我强烈建议Scrapy。

据我所知，在进行 html 解析时，不推荐使用正则表达式。试试 BeautifulSoup (BS4) 就像比萨人说的 :)

score 0 · Accepted Answer

正如@Puciek 所说，使用正则表达式将很难抓取 html。我建议你开始使用一些包，一个非常易于使用和安装的是BeautifulSoup。

安装后，您可以尝试这个简单的示例：

from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.libyaherald.com').text
bs = BeautifulSoup(html)

title = bs.find('title').text
print title

python - Python网页抓取返回错误

2 回答 2

Related

Reference