python - 再次使用 Python 从包含 URL 的 .txt 文件下载数据

Question

我目前正在尝试从 10 个 url 的 .txt 文件中提取原始数据，并将每一行（URL）的原始数据放入 .txt 文件中。然后使用 Python 对处理后的数据（从相同的原始 .txt 文件中去除 html 的原始数据）重复该过程。

import commands
import os
import json

# RAW DATA
input = open('uri.txt', 'r')
t_1 = open('command', 'w')
counter_1 = 0

for line in input:
    counter_1 += 1
if counter_1 < 11:
    filename = str(counter_1)
    print str(line)
filename= str(count)
command ='curl ' + '"' + str(line).rstrip('\n') + '"'+ '> ./rawData/' + filename

output_1 = commands.getoutput(command)
input.close()

# PROCESSED DATA
counter_2 = 0
input = open('uri.txt','r')
t_2 = open('command','w')
for line in input:
    counter_2 += 1
    if counter_2 <11:
      filename = str(counter_2) + '-processed'
      command = 'lynx -dump -force_html ' + '"'+ str(line).rstrip('\n') + '"'+'> ./processedData/' + filename
    print command
output_2 = commands.getoutput(command)
input.close()

我试图用一个脚本来完成所有这些。谁能帮我完善我的代码以便我可以运行它？对于 .txt 文件中的每一行，它应该完全循环一次代码。例如，我的 .txt 文件中的每个 url 行都应该有 1 个原始文件和 1 个处理后的 .txt 文件。

score 0 · Accepted Answer

将您的代码分解为函数。目前代码很难阅读和调试。创建一个被调用的函数get_raw()和一个被调用的函数get_processed()。然后对于你的主循环，你可以做

for line in file:
    get_raw(line)
    get_processed(line)

或类似的东西。此外，您应该避免使用“幻数”，例如counter<11. 为什么是11？它是文件中的行数吗？如果是，您可以使用len().

python - 再次使用 Python 从包含 URL 的 .txt 文件下载数据

1 回答 1

Related

Reference