python - Python urllib2 和 [errno 10054] 一个现有的连接被远程主机强行关闭和一些 urllib2 问题

Question

我编写了一个使用 urllib2 获取 URL 的爬虫。

每隔几个请求我都会遇到一些奇怪的行为，我尝试使用 Wireshark 对其进行分析，但无法理解问题所在。

getPAGE()负责获取 URL。如果成功获取 URL，则返回 URL 的内容 (response.read())，否则返回 None。

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            return response.read()
    return None

这是调用getPAGE()并检查我获取的页面是否有效的函数（检查 -companyID = soup.find('span',id='lblCompanyNumber').string如果 companyID 为 None 页面无效），如果页面有效，它将汤对象保存到名为 'curRes 的全局变量'。

def isValid(ID):
    global curRes
    try:
        address = urlPath+str(ID)
        page = getPAGE(address)
        if page == None:
            saveToCsv(ID, badRequest = True)
            return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                curRes = soup #we have the data we need, save the soup obj to a global variable
                return True
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

奇怪的行为是——

有时 urllib2 执行一个 GET 请求并且不等待它发送下一个 GET 请求的回复（忽略最后一个请求）
有时我得到“ [errno 10054] 现有连接被远程主机强行关闭”后代码只是卡住了大约 20 分钟左右等待服务器的响应，而它卡住了我复制 URL 并尝试获取它是手动的，我会在不到 1 秒的时间内得到响应（？）。
如果 getPAGE() 函数未能返回 url，它将返回 None 到 isValid()，有时我会收到错误 -

解析此页面时出错，第三个异常块：'NoneType' 对象没有属性 'string' id:....

这很奇怪，因为如果我从 getPAGE() 获得有效结果，我正在创建汤对象，而且汤函数似乎返回 None，每当我尝试运行时都会引发异常

companyID = soup.find('span',id='lblCompanyNumber').string

汤对象永远不应该是无，如果它到达代码的那部分，它应该从 getPAGE() 获取 HTML

我检查并发现问题与第一个问题有某种联系（发送 GET 而不是等待回复，我看到（在 WireShark 上）每次我遇到该异常时，都是针对 urllib2 发送 GET 请求的 url但没有等待响应并继续前进，getPAGE() 应该为该 url 返回 None，但如果它返回 None isValid(ID) 不会通过“if page == None:”条件，我可以'不找出它为什么会发生，这是不可能复制的问题。

我读过 time.sleep() 会导致urllib2 threading 出现问题，所以也许我应该避免使用它？

为什么 urllib2 不总是等待响应（它很少发生不等待的情况）？

“[errno 10054] 现有连接被远程主机强行关闭”错误我该怎么办？顺便说一句 - getPAGE() try: except 块没有捕获异常，它被第一个 isValid() try: except: 块捕获，这也很奇怪，因为 getPAGE() 假设捕获它抛出的所有异常。

try:
    address = urlPath+str(ID)
    page = getPAGE(address)
    if page == None:
        saveToCsv(ID, badRequest = True)
        return False
except Exception, e:
    print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address

谢谢！

python - Python urllib2 和 [errno 10054] 一个现有的连接被远程主机强行关闭和一些 urllib2 问题

0 回答 0

Related

Reference