5

我正在尝试编写如下函数:

def get_urls(*urls,restrictions=None):
    #here there should be some code that
    #iterates through the urls and create
    #a dictionary where the keys are the 
    #respective urls and their values are
    #a list of the possible extentions. The
    #function should return that dictionary.

首先,解释一下。如果我有一个站点:www.example.com,它只有以下页面:www.example.com/faq、www.example.com/history 和 www.example.com/page/2。这将是应用程序:

In[1]: site = 'http://example.com'
In[2]: get_urls(site)
Out[2]: {'http://example.com':['/faq','/history','/page/2']}

我花了几个小时研究,到目前为止这似乎是不可能的!那么我是否缺少一些可以做到这一点的模块?有没有一个存在但在 python 中不存在的?如果有,是什么语言?

现在您可能想知道为什么会有restrictions=None,这就是为什么:

我希望能够对可接受的 url 添加限制。例如restrictions='first',可以让它只处理与一个'/'. 这是一个例子:

In[3]: get_urls(site,restrictions='first')
Out[3]: {'http://example.com':['/faq','/history']}

我不需要继续解释限制的想法,但你明白它的必要性!一些网站,尤其是社交网络,有一些疯狂的图片附加组件,在保持原始页面包含所有照片的同时清除这些很重要。

所以是的,我绝对没有代码,但那是因为我不知道该怎么做!但我想我已经明确了我需要能够做什么,那么,这可能吗?如果是,如何?如果没有,为什么不呢?

编辑:

因此,在一些答案和评论之后,这里有更多信息。我想获得一个 url,不一定是域,并返回一个字典,其中原始 url 作为键,该 url 的所有扩展名的列表作为项目。这是我以前的一个例子'example.com'

In[4]: site = 'http://example.com/page'
In[5]: get_urls(site)
Out[5]: {'http://example.com/page':['/2']}

爬取的例子和漂亮的汤很棒,但是如果有一些没有直接链接到任何页面上的 url,那我就找不到了。是的,这通常不是问题,但我希望能够!

4

2 回答 2

13

I'm interpreting your question as "Given a URL, find the set of URLs that exist "below" that URL." - if that's not correct, please update your question, it's not very clear.

It is not possible to discover the entire set of valid paths on a domain, your only option would be to literally iterate over every valid character, e.g. /, /a, /b, /c, ..., /aa, .... and visit each of these URLs to determine if the server returns a 200 or not. I hope it's obvious this is simply not feasible.

It is possible (though there are caveats, and the website owner may not like it / block you) to crawl a domain by visiting a predefined set of pages, scraping all the links out of the page, following those links in turn, and repeating. This is essentially what Google does. This will give you a set of "discover-able" paths on a domain, which will be more or less complete depending on how long you crawl for, and how vigorously you look for URLs in their pages. While more feasible, this will still be very slow, and will not give you "all" URLs.

What problem exactly are you trying to solve? Crawling whole websites is likely not the right way to go about it, perhaps if you explain a little more your ultimate goal, we can help identify a better course of action than what you're currently imagining.


The underlying issue is there isn't necessarily any clear meaning of an "extension" to a URL. If I run a website (whether my site lives at http://example.com, http://subdomain.example.com, or http://example.com/page/ doesn't matter) I can trivially configure my server to respond successfully to any request you throw at it. It could be as simple as saying "every request to http://example.com/page/.* returns Hello World." and all of a sudden I have an infinite number of valid pages. Web servers and URLs are similar, but fundamentally not the same as hard drives and files. Unlike a hard drive which holds a finite number of files, a website can say "yes that path exists!" to as many requests as it likes. This makes getting "all possible" URLs impossible.

Beyond that, webservers often don't want you to be able to find all valid pages - perhaps they're only accessible if you're logged in, or at certain times of day, or to requests coming from China - there's no requirement that a URL always exist, or that the webserver tell you it exists. I could very easily put my infinite-URL behavior below http://example.com/secret/path/no/one/knows/about/.* and you'd never know it existed unless I told you about it (or you manually crawled all possible URLs...).

So the long story short is: No, it is not possible to get all URLs, or even a subset of them, because there could theoretically be an infinite number of them, and you have no way of knowing if that is the case.


if I can add restrictions, that will make it easier!

I understand why you think this, but unfortunately this is not actually true. Think about URLs like regular expressions. How many strings match the regular expression .*? An infinite number, right? How about /path/.*? Less? Or /path/that/is/long/and/explicit/.*? Counter intuitive though it may seem, there are actually no fewer URLs that match the last case than the first.

Now that said, my answer up to this point has been about the general case, since that's how you posed the question. If you clearly define and restrict the search space, or loosen the requirements of the question, you can get an answer. Suppose you instead said "Is it possible to get all URLs that are listed on this page and match my filter?" then the answer is yes, absolutely. And in some cases (such as Apache's Directory Listing behavior) this will coincidentally be the same as the answer to your original question. However there is no way to guarantee this is actually true - I could perfectly easily have a directory listing with secret, unlisted URLs that still match your pattern, and you wouldn't find them.

于 2013-05-29T04:18:00.850 回答
0

这个问题有一个很好的答案。本质上,您是在问为什么需要爬虫而不是所有目录的列表。维基百科解释说,“基本前提是一些网站有大量的动态页面,这些页面只能通过使用表单和用户条目来获得。”

于 2013-05-29T04:13:32.933 回答