2

我想写一个博客,让我们以此为例...www.forbes.com/sites/zillow 并使用以下输出在所有页面上抓取它的内容[如果可能的话,在 csv 中]

link = http://www.forbes.com/sites/zillow/2012/09/14/underwater-and-under-40-a-list-of-the-top-u-s-metros/
title = Underwater and Under 40: A List of the Top U.S. Metros
inlinks = #list the links in the article
picture = #list eider the number of pictures or their links
wordcount = #if this is possible
Views = #in the html of the page there is a span div tag with the number of views

任何帮助将非常感激

这是我到目前为止...更新

import urllib2
from bs4 import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create CSV
f = open('data.csv', 'w')

# Make the header rows.
f.write("date" + "," + "title" + "," + "link" + "," + "img" + "," + "inlinks" + "\n")

# URL
url = 'http://blogs.forbes.com/zillow/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)


events = soup.findAll('div', attrs={'class': 'post'})
for x in events:

# What event is being scraped
print "Getting data for " + x.find('').text

# Information scrape
date = x.find('span', attrs={'class': 'views'}) 
link = x.find('a')
headline = x.find('h2')
description = x.find('div', attrs={'class': 'entry'})
image = x.find('img')

# Extract that information in strings
date2 = str(date)
link2 = str(link)
headline2 = str(headline)
image2 = str(image)

# outside the loop
description2 = str(description)

#  replace commas with dashes so we don't screw up the CSV.
date3 = date2.replace(",", " -") 
link3 = link2.replace(",", " -")
headline3 = headline2.replace(",", " -")
description3 = description2.replace(",", " -")
image3 = image2.replace(",", " -")

# adjust the width of all images to 300 pixels.
image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
image5 = image4.replace('None', "")

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
date4 = date3.replace('<span>', "")
date5 = date4.replace('</span>', "")
date6 = date5.replace('- ', "")
date7 = date6.replace("at ", "")

headline4 = headline3.replace('<h2 class', "")
headline5 = headline4.replace('</h2>', "")
headline6 = headline5.replace('- ', "")
headline7 = headline6.replace("at ", "")


date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

# Extra formatting is also need for the description to get rid of p tags and new line returns
description4 = description3.replace('[<p>', "")
description5 = description4.replace('</p>]', "")
description6 = description5.replace('\n', " ")
description7 = description6.replace('[]', "")

link4= link3.replace('<a href', "")
link5 = link4.replace('</a>', "")
link6 = link5.replace('h2', " ")
link7 = link6.replace('=', "")


# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
f.write(date8 + "," + description7 + "," + link3 + "," + headline3 + '</h2>' + image5 + "\n")


f.close()
4

0 回答 0