python - 蟒蛇抓取

Question

我正在尝试获取餐馆的名称、地址和电话号码。

我的代码一直卡在第二个定义中。第一个 def 工作正常。我不知道为什么，因为我无法识别任何错误。循环只是没有通过。

如果我犯了一个明显的错误，我会很感激有人发表评论。

谢谢

from urllib2 import urlopen
from csv import writer

def get_urls_of_restaurant():
    list_urls = []
    n = 0
    nn = 0
    for i in range(6):
        url = urlopen('http://www.go.co.tz/index.php/restaurants/masaki?start=' +     str(nn)).readlines() #open URL whis lists restaurants
        while n < len(url):
            if '<h2 class="contentheading">' in url[n]:
                list_urls.append(url[n+1].split('"')[1])
            n += 1
        n = 0
        nn += 3
    list_urls.reverse()
    print "Geting urls done! Get %s" %len(list_urls) + ' urls.'
    return list_urls

def open_url_and_write_data(list_urls):
    n = len(list_urls)-1
    csv_file = open('restdar_guide.csv', 'wb')
    file_writer = writer(csv_file, delimiter=';')
    file_writer.writerow(['Name'] + ['address'] + ['phone'])
    while n >= 0:
        print 'Reading %s' % str(int(len(list_urls))-n) + " element of %s" % len(list_urls) + " element's..."
        url = urlopen('http://www.go.co.tz' + list_urls[n]).readlines()
        num_str = 0
        list_write = []
        while num_str < len(url):
            if '<title>' in url[num_str]:
                list_write.append(url[num_str].split('<')[0][7:])
            if 'Location:</strong>' in url[num_str]:
                list_write.append(url[num_str].split('<')[1][9:])
            else:
                list_write.append('unknown')
            if '<li><strong>Tel:</strong>' in url[num_str]:
                list_write.append(url[num_str].split('<')[2][10:])
            else:
                list_write.append('unknown')
            file_writer.writerow([list_write[0]] + [list_write[1]] + [list_write[2]])
        n -= 1
    csv_file.close()
    print 'Done!'

list_urls = get_urls_of_restaurant()
open_url_and_write_data(list_urls)

score 4 · Accepted Answer

4

BeautifulSoup可能会让你的生活更轻松一些。

于 2012-08-28T08:21:37.953 回答

score 2 · Accepted Answer

好吧，如果你中止程序，你只会得到 KeyboardInterrupt 错误。根据时间的不同，您可能会在该 while 循环中的任何行上发生错误 - 无论您最终崩溃并中止时它正在执行的指令是什么。

由于以下原因，您的程序进入了非终止循环：

num_str = 0
...
while num_str < len(url):

You never change the value of num_str, so this is equivalent to while True:, for any value of len(url) greater than 0. This is, btw, a great place for a for loop.

That said, as others have noted, this is very much a non-optimal way to do HTML parsing / web scraping. There are a number of scraping utilities and HTML parsers available, and I think you might be better off doing so.

score 1 · Accepted Answer

1

" " 的缩进n = len(list_urls)-1好像太远了，试着和下一行对齐。

于 2012-08-28T08:15:18.340 回答

python - 蟒蛇抓取

3 回答 3

Related

Reference