python - Python urllib 在 HTTP 或 URL 错误上跳过 URL

Question

如果连接超时或无效/404，如何修改我的脚本以跳过 URL？

Python

#!/usr/bin/python

#parser.py: Downloads Bibles and parses all data within <article> tags.

__author__      = "Cody Bouche"
__copyright__   = "Copyright 2012 Digital Bible Society"

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
    url = link.get('href')
    name = urlparse.urlparse(url).path.split('/')[-1]
    dirname = urlparse.urlparse(url).path.split('.')[-1]
    f = urllib2.urlopen(url)
    s = f.read()
    if (os.path.isdir(dirname) == 0):
        os.mkdir(dirname)
    soup = BeautifulSoup(s)
    articleTag = soup.html.body.article
    converted = str(articleTag)
    full_path = os.path.join(dirname, name)
    open(full_path, 'wb').write(converted)
    print(name)
print("DOWNLOADS COMPLETE!")

score 2 · Accepted Answer

要将超时应用于您的请求，请将timeout变量添加到您对urlopen. 从文档：

可选的 timeout 参数以秒为单位指定连接尝试等阻塞操作的超时时间（如果未指定，将使用全局默认超时设置）。这实际上只适用于 HTTP、HTTPS 和 FTP 连接。

请参阅本指南中有关如何使用 urllib2 处理异常的部分。实际上，我发现整个指南非常有用。

request timeout异常代码408是。总结一下，如果你要处理超时异常，你会：

try:
    response = urlopen(req, 3) # 3 seconds
except URLError, e:
    if hasattr(e, 'code'):
        if e.code==408:
            print 'Timeout ', e.code
        if e.code==404:
            print 'File Not Found ', e.code
        # etc etc

score 1 · Accepted Answer

尝试将您的 urlopen 行放在 try catch 语句下。看看这个：

docs.python.org/tutorial/errors.html 第 8.3 节

查看不同的异常，当您遇到一个时，只需使用语句 continue 重新启动循环

python - Python urllib 在 HTTP 或 URL 错误上跳过 URL

2 回答 2

Related

Reference