python - 验证文件中的 URL 是否存在

Question

所以我有一些代码可以用来在邮箱中搜索特定的 URL。完成后，它会创建一个名为 links.txt 的文件

我想针对该文件运行一个脚本，以获取该列表中所有当前 URL 的输出。我的脚本只允许我一次检查 URL

import urllib2

for url in ["www.google.com"]:

    try:
        connection = urllib2.urlopen(url)
        print connection.getcode()
        connection.close()
    except urllib2.HTTPError, e:
        print e.getcode()

score 4 · Accepted Answer

使用请求：

import requests

with open(filename) as f:
    good_links = []
    for link in file:
        try:
            r = requests.get(link.strip())
        except Exception:
            continue
        good_links.append(r.url) #resolves redirects

您还可以考虑将对 requests.get 的调用提取到辅助函数中：

def make_request(method, url, **kwargs):
    for i in range(10):
        try:
            r = requests.request(method, url, **kwargs)
            return r
        except requests.ConnectionError as e:
            print e.message
        except requests.HTTPError as e:
            print e.message
        except requests.RequestException as e:
            print e.message
    raise Exception("requests did not succeed")

score 1 · Accepted Answer

鉴于您已经在遍历 URL 列表，因此进行此更改是微不足道的：

import urllib2

for url in open("urllist.txt"):   # change 1

    try:
        connection = urllib2.urlopen(url.rstrip())   # change 2
        print connection.getcode()
        connection.close()
    except urllib2.HTTPError, e:
        print e.getcode()

遍历文件会返回文件的行（包括行尾）。我们rstrip()在 URL 上使用来去除行尾。

您还可以进行其他改进。例如，有些人会建议您使用with它来确保您的文件已关闭。这是一种很好的做法，但在此脚本中可能不是必需的。

python - 验证文件中的 URL 是否存在

2 回答 2

Related

Reference