python - 使用 Python 抓取网页

Question

我有一个包含 IMDB 前 250 部电影的 250 个 URL 的种子文件。

我需要抓取它们中的每一个并从中获取一些信息。我创建了一个函数，它获取电影的 URL 并返回我需要的信息。它工作得很好。我的问题是当我尝试在所有 250 个 URL 上运行此功能时。在成功抓取一定数量（不是恒定的！）的 URL 后，程序停止运行。python.exe 进程占用 0% CPU 并且内存消耗不会改变。经过一些调试，我认为问题出在解析上，它只是停止工作，我不知道为什么（卡在 find 命令上）。

我使用 urllib2 来获取 URL 的 HTML 内容，而不是将其解析为字符串，然后继续到下一个 URL（我对每个字符串只进行一次，所有检查和提取的线性时间）。

知道什么会导致这种行为吗？

编辑：

我附上了一个有问题的函数的代码（还有 1 个，但我猜这是同一个问题）

def getActors(html,actorsDictionary):

    counter = 0
    actorsLeft = 3
    actorFlag = 0
    imdbURL = "http://www.imdb.com"

    for line in html:        
        # we have 3 actors, stop
        if (actorsLeft == 0):
            break

        # current line contains actor information
        if (actorFlag == 1):
            endTag = str(line).find('/"    >')
            endTagA = str(line).find('</a>')

            if (actorsLeft == 3):
                actorList = str(line)[endTag+7:endTagA]
            else:
                actorList += "&#44; " + str(line)[endTag+7:endTagA]

            actorURL = imdbURL + str(line)[str(line).find('href=')+6:endTag]
            actorFlag = 0
            actorsLeft -= 1
            actorsDictionary[actorURL] = str(line)[endTag+7:endTagA]

        # check if next line contains actor information
        if (str(line).find('<td class="name">') > -1 ):
            actorFlag = 1

    # convert commas and clean \n
    actorList = actorList.replace(",","&#44; ")
    actorList = actorList.replace("\n","") 

    return actorList

我这样调用函数：

for url in seedFile:
    moviePage = urllib.request.urlopen(url) 
    print(getTitleAndYear(moviePage),",",movieURL,",",getPlot(moviePage),getActors(moviePage,actorsDictionary))

没有 getActors 功能，这很好用

这里没有引发异常（我现在删除了 try 和 catch），并且在一些迭代后它卡在了 for 循环中

编辑 2：如果我只运行 getActors 函数，它运行良好并完成种子文件中的所有 URL (250)

python - 使用 Python 抓取网页

0 回答 0

Related

Reference