python - Sifting a list returned from a webscrape produced with Beautiful Soup

Question

I am using python to code. I have been trying to webscrape the names, team images, and colleges of nba draft prospects.However when I scrape for the name of the colleges I get both the college page and the college name. How do I get it so that I only see the colleges? I have tried adding .string and .text to the end of anchor (anchor.string).

import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup

list = []
soup = BeautifulSoup(urllib2.urlopen(
                            'http://www.cbssports.com/nba/draft/mock-draft'
                             ).read()
                     )

rows = soup.findAll("table",
                    attrs = {'class':'data borderTop'})[0].tbody.findAll("tr")[2:]

for row in rows:
  fields = row.findAll("td")
  if len(fields) >= 3:
    anchor = row.findAll("td")[2].findAll("a")[1:]
    if anchor:
      print anchor

score 1 · Accepted Answer

1

Instead of just:

print anchor

use:

print anchor[0].text

于 2012-06-26T14:37:34.547 回答

score -1 · Accepted Answer

The format of an anchor in html is <a href='web_address'>Text-that-is-displayed</a> so unless there's already a fancy html parser library (I'd bet there is, just don't know of any), you'll likely need to use some kind of regular expressions to parse out the part of the anchor that you want.

python - Sifting a list returned from a webscrape produced with Beautiful Soup

2 回答 2

Related

Reference