python - 为什么这个循环返回两次？

Question

我有以下代码：

import re
from bs4 import BeautifulSoup

f = open('AIDNIndustrySearchAll.txt', 'r')
g = open('AIDNurl.txt', 'w')
t = f.read()
soup = BeautifulSoup(t)

list = []
counter = 0

for link in soup.find_all("a"):
    a = link.get('href')
    if re.search("V", a) != None:
        list.append(a)
        counter = counter + 1

new_list = ['http://www.aidn.org.au/{0}'.format(i) for i in list]
output = "\n".join(i for i in new_list)

g.write(output)

print output
print counter

f.close()
g.close()

它基本上是通过一个保存的 HTML 页面并拉出我感兴趣的链接。我是 Python 新手，所以我确信代码很糟糕，但它（几乎）可以工作；）

当前的问题是它返回每个链接的两个副本，而不是一个。我确信这与循环的设置方式有关，但有点卡住了。

我欢迎有关此问题的任何帮助（如果需要，我可以提供更多详细信息 - 例如 HTML 和有关我正在寻找的链接的更多信息）以及任何一般代码改进，以便我可以尽可能多地学习。

score 2 · Accepted Answer

由于您也要求进行代码优化，因此我将发布我的建议作为答案。随意！

from bs4 import BeautifulSoup

f = open('AIDNIndustrySearchAll.txt', 'r')
t = f.read()
f.close()

soup = BeautifulSoup(t)
results = []   ## 'list' is a built-in type and shouldn't be used as variable name

for link in soup.find_all('a'):
    a = link.get('href')
    if 'V' not in a:
        results.append(a)

formatted_results = ['http://www.aidn.org.au/{0}'.format(i) for i in results]
output = "\n".join(formatted_results)

g = open('AIDNurl.txt', 'w')
g.write(output)
g.close()

print output
print len(results)

这仍然不能解决您原来的问题，请参阅我和其他人的问题评论。

score 2 · Accepted Answer

正如其他人在评论中指出的那样，您的循环看起来不错，因此重复很可能在 HTML 本身中。如果您可以分享指向 HTML 文件的链接，也许我们可以提供更多帮助。

至于一般的代码改进，我可能会这样做：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AIDNIndustrySearchAll.txt', 'r'))

# create a generator that returns actual href entries
links = (x.get('href') for x in soup.find_all('a'))

# filter the links to only those that contain "V" and store it as a 
# set to remove duplicates
selected = set(a for a in links if "V" in a)

# build output string using selected links
output = "\n".join('http://www.aidn.org.au/{0}'.format(a) for a in selected)

# write the string to file
with open('AIDNurl.txt', 'w') as f:
  f.write(output)

print output
print len(selected)  # print number of selected links

score 0 · Accepted Answer

Find_all 返回所有元素的列表。如果你只想要第一个，你可以这样做： for link in soup.find_all("a")[:1]:. 目前尚不清楚为什么该列表是链接的副本。您可以使用打印语句来更好地了解代码。打印列表和列表长度等。也可以使用pdb

python - 为什么这个循环返回两次？

3 回答 3

Related

Reference