python - 从 HTTP 下载名称结构复杂的文件

Question

当我尝试使用此代码下载文件时：

import urllib
    urllib.urlretrieve("http://e4ftl01.cr.usgs.gov/MOLT/MOD11A1.005/2012.07.11/MOD11A1.A2012193.h22v10.005.2012196013617.hdf","1.hdf")

该文件已正确下载。

但我的目标是构建一个函数，该函数将根据作为文件名一部分的某些输入下载文件。

网页上有很多文件。每个文件的文件名的某些部分都是相同的（例如“/MOLT/MOD11A1.005/”），所以这不是问题。其他一些部分按照一些明确定义的规则（例如“h22v10”）从文件更改为文件，我已经使用 %s（例如 h%sv%s）解决了这个问题，所以这也不是问题。问题是名称的某些部分没有任何规则地更改（例如“2012196013617”）。这些部分的名称无关紧要，我想忽略这些部分。因此，我想下载名称包含前两部分（不变的部分，以及根据规则更改的部分）和其他任何内容的文件。

我想，我可以将通配符用于 WHATEVER，所以我尝试了这个：

  import urllib

  def download(url,date,h,v):
      urllib.urlretrieve("%s/MOLT/MOD11A1.005/%s/MOD11A1.*.h%sv%s.005.*.hdf" %
        (url, date1, h, v), "2.hdf")

  download("http://e4ftl01.cr.usgs.gov", "2012.07.11", "22", "10")

这不会下载请求的文件，而是生成一个错误文件，其中显示：

 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
 <html>
   <head>
     <title>404 Not Found</title>
   </head>
   <body>
     <h1>Not Foun    d</h1>
     <p>The requested URL /MOLT/MOD11A1.005/2012.07.11/MOD11A1\*\h22v10.005\*\.hdf was not found on this server.</p    >
   </body>
 </html>

通配符似乎不适用于 HTTP。你知道如何解决这个问题吗？

score 2 · Accepted Answer

问题是名称的某些部分没有任何规则地更改（例如“2012196013617”）。名称的这些部分无关紧要，我想忽略这些部分

这是不可能的。HTTP URL 不支持“通配符”。您必须提供现有的 URL。

score 0 · Accepted Answer

这是一个解决方案：这假设 PartialName 是一个带有文件名第一部分的字符串（尽可能多的已知和常量），URLtoSearch 是可以找到文件的 URL（也是一个字符串），并且 FileExtension “.ext”、“.mp3”、“.zip”等形式的字符串

def findURLFile(PartialName, URLtoSearch, FileExtension):
    import urllib2

    sourceURL = urllib2.urlopen(URLtoSearch)
    readURL = sourceURL.read()

    #find the first instance of PartialName and get the Index
    #of the first character in the string (an integer)
    fileIndexStart = readURL.find(PartialName)

    #find the first instance of the file extension after the first
    #instance of the string and add 4 to get past the extension
    fileIndexEnd = readURL[fileIndexStart:].find(FileExtension) + 4

    #get the filename
    fileName = readURL[fileIndexStart:fileIndexStart+fileIndexEnd]

    #stop reading the url -not sure if this is necessary 
    sourceURL.close()
    #output the URL to download the file from
    downloadURL = URLtoSearch + fileName
    return downloadURL

我在编写 python 方面相当新，这可能会受益于一些异常处理，也许还有一个 while 循环。它可以满足我的需要，但我可能会改进代码并使其更优雅。

python - 从 HTTP 下载名称结构复杂的文件

2 回答 2

Related

Reference