python - 使用目录浏览递归搜索网站上的文件

Question

有没有办法通过 http 目录浏览来定位文件或目录是否存在于 web 服务器上？我有一个包含许多文件和目录的站点。我想遍历目录并找到可以位于子目录中任何位置的给定文件。通常我们可以os.path.isfile("file_name")在文件系统上执行此操作，但这不适用于 HTTP 上的目录浏览。我们怎么能做到这一点？

score 3 · Accepted Answer

在 Web 上这样做并不像使用文件系统那样简单。一方面，文件夹列表会有所不同，具体取决于它是什么网络服务器。所以你必须知道列表的格式。例如，我注意到大多数 linux/apache 服务器的一个模式是文件夹以斜杠“/”结尾，文件没有。父文件夹以斜杠开头，文件夹不……等等……

这只是一个示例（确实有效），应该让您朝着正确的方向开始。要运行示例，您必须安装BeautifulSoup

import urllib.request
from bs4 import BeautifulSoup

def RecurseLinks(base):

    f = urllib.request.urlopen(base)
    soup = BeautifulSoup(f.read())
    for anchor in soup.find_all('a'):
        href = anchor.get('href')
        if (href.startswith('/')):
            print ('skip, most likely the parent folder -> ' + href)
        elif (href.endswith('/')):
            print ('crawl -> [' + base + href + ']')
            RecurseLinks(base + href) # make recursive call w/ the new base folder
        else:
            print ('some file, check if xyz.txt -> ' + href) # save it to a list or return 

# call the initial root web folder
RecurseLinks('http://somesite-xyx.com.com/directory-browsing/')

python - 使用目录浏览递归搜索网站上的文件

1 回答 1

Related

Reference