python - Python：从特定的 href 打印数据（带有 ID 标签）

Question

我是 Python 新手，正在尝试构建我的第一个网络爬虫。我想去一个页面，打开一堆子页面，在页面上找到一个特定的链接（带有ID），然后我想打印链接数据。现在我得到错误：'列表索引必须是整数，而不是 str'，这意味着我在（至少）最后一行代码中做错了。

我真正不确定的是，我需要做什么才能从特定链接中获取和解析 href 数据——因为我认为，其余的都在工作（加载子页面）。刮板是（假设）抓取丹麦公社的所有网址并打印出来，所以打印的第一行应该是：

http://www.albertslund.dk（关注 97 更多）

无论如何，这是代码 - 希望任何人都可以告诉我，我做错了什么。提前感谢一堆。

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

f = open("kommuneadresser.txt", "w")
br = Browser()
url = "https://bdkv2.borger.dk/foa/Sider/default.aspx?fk=22&foaid=11541520"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
link = soup.findAll('a')
kommunelink = link[21:116]

#we create a loop - for every single kommunelink in the list, 
#'something' will happen
for kommune in kommunelink:
    #the link-part is saved as a string
    kommuneurl = kommune['href']
    #we construct a new url from two strings
    fuldurl = "https://bdkv2.borger.dk/" + kommuneurl
    #we open the page and save it in a variable
    kommuneside = br.open(fuldurl)
    #we read the page
    html2 = kommuneside.read()
    #we handle the page in beautifulsoup
    soup2 = BeautifulSoup(html2)
    #we find a specific link on the page
    hjemmesidelink = soup2.findAll('a', attras={'ID':"uscAncHomesite"})
    print hjemmesidelink['href']

score 1 · Accepted Answer

1

你试过这个吗？

for link in soup.find_all('a'):
    print(link.get('href'))

于 2012-07-30T14:22:51.470 回答

score 1 · Accepted Answer

首先，BeautifulSoup。findAll () 返回一个List。

此外，您可能希望在soup2 中执行最后一个findAll。我不确定您需要 hjemmesidelink 中的哪个项目，因此请在最后 5 行代码中尝试此操作：

#we handle the page in beautifulsoup
soup2 = BeautifulSoup(html2)
#we find a specific link on the page
hjemmesidelink = soup2.findAll('a', attras={'ID':"uscAncHomesite"})
print hjemmesidelink

你会以这种方式打印第一个项目

print hjemmesidelink[0]

python - Python：从特定的 href 打印数据（带有 ID 标签）

2 回答 2

Related

Reference