python - 如何从日志文本文件中获取所有应用程序链接？

Question

我有一个日志文件，其中包含：

http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
http://www.downloadray.com/windows/Photos_and_Images/Graphic_Capture/
http://www.downloadray.com/windows/Photos_and_Images/Digital_Photo_Tools/

我有这个代码：

from bs4 import BeautifulSoup
import urllib
import urlparse

f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")

for line in f.readlines():
    i = 1
    while 1:
        url = line+"?page=%d" % i
        pageHtml = urllib.urlopen(url)
        soup = BeautifulSoup(pageHtml)

        has_more = 1
        for a in soup.select("div.n_head2 a[href]"):
            try:
                print (a["href"])
                g.write(a["href"]+"\n")
            except:
                print "no link"
        if has_more:
            i += 1
        else:
            break

此代码不会给出错误，但它不起作用。我尝试修改它但无法解决它。但是当我尝试这段代码时，它运行良好：

from bs4 import BeautifulSoup
import urllib
import urlparse

g = open("downloadray3.txt", "w")

url = "http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/"
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)

i = 1
while 1:
    url1 = url+"?page=%d" % i
    pageHtml = urllib.urlopen(url1)
    soup = BeautifulSoup(pageHtml)

    has_more = 2

    for a in soup.select("div.n_head2 a[href]"):
        try:
            print (a["href"])
            g.write(a["href"]+"\n")
        except:
            print "no link"
    if has_more:
        i += 1
    else:
        break

那么我怎样才能让它可以从日志文本文件中读取。很难逐个链接阅读。

score 1 · Accepted Answer

您是否从行尾剥离了换行符？

for line in f.readlines():
    line = line.strip()

readlines()将生成从文件中获取的行列表，包括换行符\n。

证明url打印变量的证据（在行之后url = line+"?page=%d" % i）：

您的原始代码：

http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=1
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=2
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
?page=3

使用我建议的修复：

http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=1
http://www.downloadray.com/TIFF-to-JPG_download/
http://www.downloadray.com/Moo0-Image-Thumbnailer_download/
http://www.downloadray.com/Moo0-Image-Sizer_download/
http://www.downloadray.com/Advanced-Image-Viewer-and-Converter_download/
http://www.downloadray.com/GandMIC_download/
http://www.downloadray.com/SendTo-Convert_download/
http://www.downloadray.com/PNG-To-JPG-Converter-Software_download/
http://www.downloadray.com/Graphics-Converter-Pro_download/
http://www.downloadray.com/PICtoC_download/
http://www.downloadray.com/Free-Images-Converter_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=2
http://www.downloadray.com/VarieDrop_download/
http://www.downloadray.com/Tinuous_download/
http://www.downloadray.com/Acme-CAD-Converter_download/
http://www.downloadray.com/AAOImageConverterandFTP_download/
http://www.downloadray.com/ImageCool-Converter_download/
http://www.downloadray.com/GeoJpeg_download/
http://www.downloadray.com/Android-Resizer-Tool_download/
http://www.downloadray.com/Scarab-Darkroom_download/
http://www.downloadray.com/Jpeg-Resizer_download/
http://www.downloadray.com/TIFF2PDF_download/
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=3
http://www.downloadray.com/JGraphite_download/
http://www.downloadray.com/Easy-PNG-to-Icon-Converter_download/
http://www.downloadray.com/JBatch-It!_download/
http://www.downloadray.com/Batch-It!-Pro_download/
http://www.downloadray.com/Batch-It!-Ultra_download/
http://www.downloadray.com/Image-to-Ico-Converter_download/
http://www.downloadray.com/PSD-To-PNG-Converter-Software_download/
http://www.downloadray.com/VectorNow_download/
http://www.downloadray.com/KeitiklImages_download/
http://www.downloadray.com/STOIK-Smart-Resizer_download/

更新：

再说一遍，这段代码不会按预期运行，因为while循环永远不会继续，因为has_more变量永远不会改变。

当 `soup.select(...)` 返回的列表为空时，您知道您没有更多链接。您可以使用 `len(...)` 检查是否为空。所以那部分可能是这样的：
list_of_links = soup.select("div.n_head2 a[href]") 如果 len(list_of_links)==0: 休息别的： for a in soup.select("div.n_head2 a[href]"): 打印 (a["href"]) g.write(a["href"]+"\n") 我 += 1

显然，如果查询超过最大页面，该页面仍会显示可用的最新页面。因此，如果可用的最大页码为 82，并且您查询第 83 页，它将给出第 82 页。要检测这种情况，您可以保存以前的页面 url 列表，并将其与当前的 url 列表进行比较。

这是完整的代码（经过测试）：

from bs4 import BeautifulSoup
import urllib
import urlparse

f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")

for line in f.readlines():
    line = line.strip()
    i = 1
    prev_urls = []
    while 1:
        url = line+"?page=%d" % i
        print 'Examining %s' % url
        pageHtml = urllib.urlopen(url)
        soup = BeautifulSoup(pageHtml)

        list_of_urls = soup.select("div.n_head2 a[href]")
        if set(prev_urls)==set(list_of_urls):
            break
        else:
            for a in soup.select("div.n_head2 a[href]"):
                print (a["href"])
                g.write(a["href"]+"\n")
            i += 1
            prev_urls = list_of_urls

python - 如何从日志文本文件中获取所有应用程序链接？

1 回答 1

Related

Reference