1

这是我的第一个 Python 项目,所以它非常基础和初级。我经常要为朋友清除病毒,并且我使用的免费程序经常更新。我没有手动下载每个程序,而是尝试创建一种简单的方法来自动化该过程。由于我也在尝试学习python,我认为这将是一个很好的练习机会。

问题:

我必须找到带有一些链接的 .exe 文件。我可以找到正确的 URL,但在尝试下载时出现错误。

有没有办法将所有链接添加到列表中,然后创建一个函数来遍历列表并在每个 url 上运行该函数?我已经用谷歌搜索了很多,但我似乎无法让它发挥作用。也许我没有朝着正确的方向思考?

import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup

# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []

# Find exe files to download

match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)

# Check links
#def findexe():
for link in page.findAll('a'):
    try:
        href = link['href']
        if re.search(match, href):
            urllist2.append(href)

    except KeyError:
        pass

os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))

如您所见,我已将功能注释掉,因为我无法使其正常工作。

我应该放弃列表并单独下载它们吗?我试图提高效率。

任何建议,或者如果您能指出我正确的方向,将不胜感激。

4

3 回答 3

1

除了mikez302 的回答之外,这里还有一种更易读的方式来编写您的代码:

import os
import re
import urllib
import urllib2

from BeautifulSoup import BeautifulSoup

websites = [
    'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
    'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
    'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
    'http://www.simplysup.com/tremover/download.html'
]

download_links = []

for url in websites:
    connection = urllib2.urlopen(url)
    soup = BeautifulSoup(connection)
    connection.close()

    for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
        download_links.append(link['href'])

for url in download_links:
    urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))
于 2012-11-15T00:57:28.800 回答
0

urllib2.urlopen是用于访问单个 URL 的函数。如果要访问多个,则应遍历列表。你应该这样做:

for url in urllist:
    data = urllib2.urlopen(url)
    page = BeautifulSoup(data)

    # Check links
    for link in page.findAll('a'):
        try:
            href = link['href']
            if re.search(match, href):
                urllist2.append(href)

        except KeyError:
            pass

    os.chdir(r"C:\_VirusFixes")
    urllib.urlretrieve(urllist2, os.path.basename(urllist2))
于 2012-11-15T00:52:12.450 回答
0

上面的代码对我不起作用,在我的情况下,这是因为页面通过脚本组装它们的链接,而不是将其包含在代码中。当我遇到这个问题时,我使用了以下代码,它只是一个刮板:

import os
import re
import urllib
import urllib2

from bs4 import BeautifulSoup

url = ''

connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here 
regex = '(.+?).zip'       #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
    link[x]=i.split(' ')[len(i.split(' '))-1] 
# When it finds all the .zip, it usually comes back with a lot of undesirable 
# text, luckily the file name is almost always separated by a space from the 
# rest of the text which is why we do the split
    x+=1  

os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download

for i in link:
    urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that. 

这不如先前答案中的代码有效,但它几乎适用于任何站点。

于 2013-08-24T00:36:22.387 回答