1

截至目前,我正在尝试抓取 Good.is。截至目前的代码为我提供了常规图像(将 if 语句变为 True),但我想要更高分辨率的图片。我想知道如何替换某个文本以便下载高分辨率图片。我想更改 html:http ://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html为http://awesome.good.is/transparency/web/1207 /invasion-of-the-drones/flat.html(结局不同)。我的代码是:

import os, urllib, urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser

parser = HTMLParser.HTMLParser()

# make folder.
folderName = 'Good.is'
if not os.path.exists(folderName):
  os.makedirs(folderName)


list = [] 
# Python ranges start from the first argument and iterate up to one
# less than the second argument, so we need 36 + 1 = 37
for i in range(1, 37):
    list.append("http://www.good.is/infographics/page:" + str(i) + "/sort:recent/range:all")


listIterator1 = []
listIterator1[:] = range(0,37)      
counter = 0


for x in listIterator1:


    soup = BeautifulSoup(urllib2.urlopen(list[x]).read())

    body = soup.findAll("ul", attrs = {'id': 'gallery_list_elements'})

    number = len(body[0].findAll("p"))
    listIterator = []
    listIterator[:] = range(0,number)        

    for i in listIterator:
        paragraphs = body[0].findAll("p")
        nextArticle = body[0].findAll("a")[2]
        text = body[0].findAll("p")[i]

        if len(paragraphs) > 0:
            #print image['src']
            counter += 1
            print counter
            print parser.unescape(text.getText())
            print "http://www.good.is" + nextArticle['href']
            originalArticle = "http://www.good.is" + nextArticle['href']
            article = BeautifulSoup(urllib2.urlopen(originalArticle).read())
            title = article.findAll("div", attrs = {'class': 'title_and_image'})
            getTitle = title[0].findAll("h1") 
            article1 = article.findAll("div", attrs = {'class': 'body'})
            articleImage = article1[0].find("p")
            betterImage = articleImage.find("a")
            articleImage1 = articleImage.find("img")
            paragraphsWithinSection = article1[0].findAll("p")
            print betterImage['href']
            if len(paragraphsWithinSection) > 1:
                articleText = article1[0].findAll("p")[1]
            else:
                articleText = article1[0].findAll("p")[0]
            print articleImage1['src']
            print parser.unescape(getTitle)
            if not articleText is None:
                print parser.unescape(articleText.getText())
            print '\n'
            link = articleImage1['src']
            x += 1


            actually_download = False
            if actually_download:
                filename = link.split('/')[-1]
                urllib.urlretrieve(link, filename)
4

4 回答 4

3

看看str.replace。如果这还不足以完成工作,那么您将需要使用正则表达式(re -- 可能re.sub)。

>>> str1="http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html"
>>> str1.replace("flash","flat")
'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flat.html'
于 2012-08-03T16:31:06.430 回答
1

我认为最安全和最简单的方法是使用正则表达式:

import re
url = 'http://www.google.com/this/is/sample/url/flash.html'
newUrl = re.sub('flash\.html$','flat.html',url)

"$" 表示只匹配字符串的结尾。即使在您的 url 包含子字符串“flash.html”而不是结尾的情况下(当然不太可能),此解决方案也会正确运行,并且如果字符串没有结束,也会使字符串保持不变(我认为这是正确的行为)使用“flash.html”。

请参阅:http ://docs.python.org/library/re.html#re.sub

于 2012-08-03T17:08:02.727 回答
0

@mgilson 有一个很好的解决方案,但问题是它将用替换替换所有出现的字符串;因此,如果您在 URL 中包含“flash”一词(而不仅仅是尾随文件名),您将有多个替换:

>>> str = 'hello there hello'
>>> str.replace('hello','world')
'world there world' 

另一种解决方案是将最后一部分替换/flat.html

>>> url = 'http://www.google.com/this/is/sample/url/flash.html'
>>> url[:url.rfind('/')+1]+'flat.html'
'http://www.google.com/this/is/sample/url/flat.html'
于 2012-08-03T16:45:44.030 回答
0

使用urlparse你可以做一些小事:

from urlparse import urlsplit, urlunsplit, urljoin

s = 'http://awesome.good.is/transparency/web/1207/invasion-of-the-drones/flash.html'

url = urlsplit(s)
head, tail = url.path.rsplit('/', 1)
new_path = head, 'flat.html'
print urlunsplit(url._replace(path=urljoin(*new_path)))
于 2012-08-03T17:07:22.613 回答