0

I can't understand how to deal with more than one URL. This is what I've tried so far, but it's only scraping the last URL from the list:

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        line = url 

site = urlopen(url)   

soup = BeautifulSoup(site)

for td in soup.find_all('td', {'class': 'subjectCell'}):
    print td.find('a').text
4

3 回答 3

2

这些代码应该在 for 循环内

site = urlopen(url)   

soup = BeautifulSoup(site)

for td in soup.find_all('td', {'class': 'subjectCell'}):
    print td.find('a').text

然后他们将被调用每个 url。

于 2012-11-08T02:20:18.380 回答
0

If you want to loop over all of the URLs, you have to put the code that processes each URL into the loop. But you haven't done that. All you have is:

for url in urls:
    line = url

This will reassign the variables url and line over and over, finally leaving them both pointing to the last URL. And then, when you call size = urlopen(url) outside the loop, it will work on the last URL.

Try this:

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text
于 2012-11-08T02:11:28.363 回答
0

You need to put everything you want to do with each url into the for loop:

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site)

        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text
于 2012-11-08T02:12:09.617 回答