1

我有两个可以正常工作的函数,但是当我将它们嵌套在一起运行时似乎崩溃了。

def scrape_all_pages(alphabet):
    pages = get_all_urls(alphabet)
    for page in pages:
        scrape_table(page)

我正在尝试系统地抓取一些搜索结果。因此get_all_pages(),为字母表中的每个字母创建一个 URL 列表。有时有数千页,但效果很好。然后,对于每一页,scrape_table只抓取我感兴趣的表格。这也很好。我可以运行整个程序并且运行良好,但我在 Scraperwiki 工作,如果我将其设置为运行并离开,它总是会给我一个“列表索引超出范围”错误。这绝对是 scraperwiki 中的一个问题,但我想通过添加一些try/except子句并在遇到错误时记录错误来找到解决问题的方法。就像是:

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
    except:
        ## LOG THE ERROR IF THAT FAILS.
    try:
        for page in pages:
            scrape_table(page)
    except:
        ## LOG THE ERROR IF THAT FAILS

不过,我无法弄清楚如何一般地记录错误。此外,上面的代码看起来很笨拙,根据我的经验,当某些东西看起来很笨拙时,Python 有更好的方法。有没有更好的办法?

4

5 回答 5

2

您可以指定要捕获的特定类型的异常和保存异常实例的变量:

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
        for page in pages:
            scrape_table(page)
    except OutOfRangeError as error:
        # Will only catch OutOfRangeError
        print error
    except Exception as error:
        # Will only catch any other exception
        print error

捕获 Exception 类型将捕获所有错误,因为它们应该都是从 Exception 继承的。

这是我所知道的捕获错误的唯一方法。

于 2012-11-25T19:44:21.820 回答
2

将日志信息包装在上下文管理器周围,就像这样,尽管您可以轻松更改详细信息以满足您的要求:

import traceback

# This is a context manager
class LogError(object):
    def __init__(self, logfile, message):
        self.logfile = logfile
        self.message = message
    def __enter__(self):
        return self
    def __exit__(self, type, value, tb):
        if type is None or not issubclass(type, Exception):
            # Allow KeyboardInterrupt and other non-standard exception to pass through
            return

        self.logfile.write("%s: %r\n" % (self.message, value))
        traceback.print_exception(type, value, tb, file=self.logfile)
        return True # "swallow" the traceback

# This is a helper class to maintain an open file object and
# a way to provide extra information to the context manager.
class ExceptionLogger(object):
    def __init__(self, filename):
        self.logfile = open(filename, "wa")
    def __call__(self, message):
        # override function() call so that I can specify a message
        return LogError(self.logfile, message)

关键部分是 __exit__ 可以返回 'True',在这种情况下异常被忽略,程序继续执行。代码还需要小心一点,因为可能会引发 KeyboardInterrupt (control-C)、SystemExit 或其他非标准异常,并且您确实希望程序在哪里停止。

您可以像这样在代码中使用上述内容:

elog = ExceptionLogger("/dev/tty")

with elog("Can I divide by 0?"):
    1/0

for i in range(-4, 4):
    with elog("Divisor is %d" % (i,)):
        print "5/%d = %d" % (i, 5/i)

该片段给了我输出:

Can I divide by 0?: ZeroDivisionError('integer division or modulo by zero',)
Traceback (most recent call last):
  File "exception_logger.py", line 24, in <module>
    1/0
ZeroDivisionError: integer division or modulo by zero
5/-4 = -2
5/-3 = -2
5/-2 = -3
5/-1 = -5
Divisor is 0: ZeroDivisionError('integer division or modulo by zero',)
Traceback (most recent call last):
  File "exception_logger.py", line 28, in <module>
    print "5/%d = %d" % (i, 5/i)
ZeroDivisionError: integer division or modulo by zero
5/1 = 5
5/2 = 2
5/3 = 1

我认为也很容易看出如何修改代码以处理仅记录 IndexError 异常,甚至传递基本异常类型以进行捕获。

于 2012-11-25T20:16:07.683 回答
0

也许记录每次迭代的错误,以便一次迭代中的错误不会破坏您的循环:

for page in pages:
    try:
        scrape_table(page)
    except:
        #open error log file for append:
        f=open("errors.txt","a")
        #write error to file:
        f.write("Error occured\n") # some message specific to this iteration (page) should be added here...
        #close error log file:
        f.close()
于 2012-11-25T19:45:56.650 回答
0

这是一个好方法,但是。您不应该只使用except子句,您必须指定要捕获的异常的类型。您也可以捕获错误并继续循环。

def scrape_all_pages(alphabet):
    try:
        pages = get_all_urls(alphabet)
    except IndexError: #IndexError is an example
        ## LOG THE ERROR IF THAT FAILS.

    for page in pages:
        try:
            scrape_table(page)
        except IndexError: # IndexError is an example
            ## LOG THE ERROR IF THAT FAILS and continue this loop
于 2012-11-25T19:46:54.290 回答
0

最好这样写:

    try:
        pages = get_all_urls(alphabet)
    except IndexError:
        ## LOG THE ERROR IF THAT FAILS.
    for page in pages:
        try:
            scrape_table(page)
        except IndexError:
            continue ## this will bring you to the next item in for
        ## LOG THE ERROR IF THAT FAILS
于 2015-07-29T11:51:17.287 回答