我正在尝试使用漂亮的汤来抓取基于原子的 RSS 提要,但事实证明这很困难。捕获数据一直很好,直到<item>出现破坏代码并使脚本崩溃的情况。这样<item>的 s 始终有标签(firefox 将它们标记为橙色),例如“<” 或“& quot;”,而没有它们的 s 可以正常工作。我已经尝试了很多东西,比如 BeautifulStoneSoup,用正则表达式去除特殊字符,并设置“xml”参数,但没有任何效果,而且它们通常只是发出关于在 BS4 中被弃用的警告。


这是我要抓取的页面: http ://www.thestar.com/feeds.articles.news.gta.rss


news_url = "http://www.thestar.com/feeds.articles.news.gta.rss" # Toronto Star RSS Feed

    news_rss = urllib2.urlopen(news_url)
    news = news_rss.read()
    soup = BeautifulSoup(news)
    return "error"

titles = soup.findAll('title')
links = soup.findAll('link')

for link in links:
    link = link.contents    # I want the url without the <link> tags

news_stuff = []
for item in titles:
    if item.text == "TORONTO STAR | NEWS | GTA":    # These have <title> tags and I don't want them; just skip 'em.
        news_stuff.append((item.text, links[i]))    # Here's a news story.  Grab it.

i = 0
for thing in news_stuff:
    print '<a href="' 
    print thing[1]
    print '"target="_blank">' 
    print thing[0]
    print '</a><br/>'
    i += 1

2 回答 2



UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 54: ordinal not in range(128)


for thing in news_stuff:
    print '<a href="' 
    print thing[1]
    print '"target="_blank">' 
    print thing[0].encode("utf-8")
    print '</a><br/>'
    i += 1


于 2013-10-07T18:07:02.937 回答


from string import punctuation, whitespace
import urllib2
import datetime
import re
import MySQLdb
import csv
from bs4 import BeautifulSoup as Soup
news_url = "http://www.thestar.com/feeds.articles.news.gta.rss" # Toronto Star RSS Feed

news_rss = urllib2.urlopen(news_url)
news = news_rss.read()
soup = Soup(news)

titles = soup.findAll('title')
links = soup.findAll('link')

for link in links:
    link = link.contents    # I want the url without the <link> tags
news_stuff = []
for item in titles:
    if item.text == "TORONTO STAR | NEWS | GTA":    # These have <title> tags and I don't want them; just skip 'em.
        news_stuff.append((item.text, links[i]))    # Here's a news story.  Grab it.

i = 0
for thing in news_stuff:
    print '<a href="' 
    print thing[1]
    print '"target="_blank">' 
    print thing[0]
    print '</a><br/>'
    i += 1


<a href="
TTC argues for return of special constables
<a href="
Health information of 18,000 people stolen in Peel Region
<a href="
Fire closes Bathurst St. south of Dupont
<a href="
Empty tanker train cars derail in Brampton
<a href="
Medical illustration studios flourish in Toronto
<a href="
In Texas, Toronto music leaders urge city hall to say ‘yes’
<a href="
Making sense of the Sammy Yatim shooting: Fiorito
<a href="
Toronto’s chief planner, Jennifer Keesmaat, challenges Mirvish/Gehry scheme: Hume
<a href="
Westbound Gardiner lanes reopen after rollover near Spadina
<a href="
Daycare Crisis: Halton health complaints show gaps in unlicensed care
<a href="
Witness describes shooting details as man confronted police near van
<a href="
Muslim AIDS activist honoured for taboo-busting work
<a href="
Death to death with dignity: DiManno
<a href="
Rockers join forces in Line 9 protest
<a href="
Could you eat 10 pizzas in 12 minutes? This guy did
<a href="
Former participants speak up about gay healing program
<a href="
Freed Canadians Tarek Loubani and John Greyson awaiting papers to come home from Egypt
<a href="
Man dies after crash at Finch and Dufferin
<a href="
Nuit Blanche lights up Toronto Saturday night
<a href="
Leafs fans celebrate home opener at Maple Leaf Square
于 2013-10-07T19:47:53.477 回答