python - 如何在使用后从文件中删除一行

Question

我正在尝试创建一个脚本，该脚本向 txt 文件中的随机 url 发出请求，例如：

import urllib2

with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"

但我希望当某些 url 指示时404 not found，包含该 URL 的行会从文件中删除。每行有一个唯一的 URL，所以基本上目标是删除返回的每个 URL（及其对应的行）404 not found。我怎样才能做到这一点？

score 2 · Accepted Answer

您可以简单地保存所有有效的 URL，然后将它们重写到文件中：

good_urls = []
with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"
            good_urls.append(url)
with open('urls.txt', 'w') as urls:
    urls.write("".join(good_urls))

score 1 · Accepted Answer

最简单的方法是读取所有行，遍历保存的行并尝试打开它们，然后在完成后，如果任何 URL 失败，则重写文件。

改写文件的方法是写一个新文件，然后当新文件写入成功并关闭后，再用os.rename()把新文件名改成旧文件名，覆盖旧文件。这是安全的方法；在您知道正确写入新文件之前，您永远不会覆盖好文件。

我认为最简单的方法就是创建一个列表，您可以在其中收集好的 URL，并计算失败的 URL。如果计数不为零，则需要重写文本文件。或者，您可以在另一个列表中收集错误的 URL。我在这个示例代码中做到了。（我没有测试过这段代码，但我认为它应该可以工作。）

import os
import urllib2

input_file = "urls.txt"
debug = True

good_urls = []
bad_urls = []

bad, good = range(2)

def track(url, good_flag, code):
    if good_flag == good:
        good_str = "good"
    elif good_flag == bad:
        good_str = "bad"
    else:
        good_str = "ERROR! (" + repr(good) + ")"
    if debug:
        print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
    if good_flag == good:
        good_urls.append(url)
    else:
        bad_urls.append(url)

with open(input_file) as f:
    for line in f:
        url = line.strip()
        try:
            r = urllib2.urlopen(url)
            if r.code in (200, 401):
                print '[{0}]: '.format(url), "Up!"
            if r.code == 404:
                # URL is bad if it is missing (code 404)
                track(url, bad, r.code)
            else:
                # any code other than 404, assume URL is good
                track(url, good, r.code)
        except urllib2.URLError as e:
            track(url, bad, "exception!")

# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
    # simple way to get a filename for temp file: append ".tmp" to filename
    temp_file = input_file + ".tmp"
    with open(temp_file, "w") as f:
        for url in good_urls:
            f.write(url + '\n')
    # if we reach this point, temp file is good.  Remove old input file
    os.remove(input_file)  # only needed for Windows
    os.rename(temp_file, input_file)  # replace original input file with temp file

编辑：在评论中，@abarnert 建议在 Windows 上使用可能存在问题os.rename()（至少我认为这是他/她的意思）。如果os.rename()不起作用，您应该可以shutil.move()改用。

编辑：重写代码以处理错误。

编辑：重写以在跟踪 URL 时添加详细消息。这应该有助于调试。另外，我实际上测试了这个版本，它对我有用。

python - 如何在使用后从文件中删除一行

2 回答 2

Related

Reference