python - 如何使用 Python 处理 URL

Question

我有以下代码（doop.py），它剥离了所有“废话”html脚本的.html文件，只输出“人类可读”文本；例如。它将包含一个包含以下内容的文件：

<html>
<body>

<a href="http://www.w3schools.com">
This is a link</a>

</body>
</html>

并给

$ ./doop.py
File name: htmlexample.html

This is a link

接下来我需要添加一个函数，如果文件中的任何 html 参数表示 URL（网址），程序将读取指定网页的内容而不是磁盘文件。（对于目前的目的，doop.py 将一个以 http:// 开头的参数（在任何字母大小写的混合中）识别为 URL 就足够了。）

我不知道从哪里开始 - 我确定它会涉及告诉 python 打开一个 URL，但我该怎么做呢？

谢谢，

一个

score 2 · Accepted Answer

除了urllib2已经提到的其他内容，您可以查看 Kenneth Reitz 的Requests模块。它的语法比urllib2.

import requests
r = requests.get('https://api.github.com', auth=('user', 'pass'))
r.text

score 1 · Accepted Answer

与大多数pythonic一样：有一个库。

这里需要 urllib2 库

这允许您像打开文件一样打开 url，并像文件一样从中读取和写入。

您需要的代码如下所示：

import urllib2

urlString = "http://www.my.url"
try:
    f = urllib2.urlopen(urlString)  #open url
    pageString = f.read()           #read content
    f.close()                       #close url
    readableText = getReadableText(pageString)
    #continue using the pageString as you wish
except IOException:
    print("Bad URL")

更新：（我手头没有 python 解释器，所以无法测试这段代码是否可以工作，但它应该！！）打开 URL 是很容易的部分，但首先你需要从中提取 URL你的 html 文件。这是使用正则表达式（regex's）完成的，不出所料，python 有一个库（re）。我建议您阅读这两个正则表达式，但它们基本上是您可以匹配文本的模式。

所以你需要做的是编写一个匹配 URL 的正则表达式：

(http|ftp|https)://[\w-_]+(.[\w-_]+)+([\w-.,@?^=%&:/~+#]*[\ w-\@?^=%&/~+#])? 如果您不想通过 url 访问 ftp 资源，请删除“ftp|” 从模式的开始。现在，您可以扫描输入文件以查找与此模式匹配的所有字符序列：

import re

input_file_str = #open your input file and read its contents
pattern = re.compile("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?") #compile the pattern matcher
matches = pattern.findall(input_file_str) #find all matches, storing them in an interator
for match in matches :  #go through iteratr
    urlString = match   #get the string that matched the pattern
    #use the code above to load the url using matched string!

应该这样做

score 0 · Accepted Answer

您可以使用第三方库，例如beautifulsoup或 Standard HTML Parser。这是以前的堆栈溢出问题。html解析器python

其他链接

http://unethicalblogger.com/2008/05/03/parsing-html-with-python.html

标准库

http://docs.python.org/library/htmlparser.html

性能对比

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

解析时需要解析http

score 0 · Accepted Answer

与其编写你自己的 HTML Parser / Scraper，我个人会推荐Beautiful Soup，你可以使用它来加载你的 HTML，从中获取你想要的元素，找到所有的链接，然后使用urllib来获取新的链接您可以进一步解析和处理。

python - 如何使用 Python 处理 URL

4 回答 4

其他链接

标准库

性能对比

Related

Reference