python - Python Web Scraping - 导航到 Next_Page 链接并获取数据

Question

我正在使用 Python 和 Beautiful Soup 从Civic Commons - Social Media链接获取可用软件的 url。我想要所有社交媒体软件的链接（分布在 20 页）。我能够获得第一页中列出的软件的 url。

下面是我为获取这些值而编写的 Python 代码。

from bs4 import BeautifulSoup
import re
import urllib2

base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

list_of_links = [] 
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
   string_temp_link = base_url+link_tag.get('href')
   list_of_links.append(string_temp_link)

list_of_links = list(set(list_of_links))  

for link_item in list_of_links:
   print link_item

print ("\n")

#Newly added code to get all Next Page links from a url    
next_page_links = [] 
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
   string_temp_link = base_url+link_tag.get('href')
   next_page_links.append(string_temp_link)
for next_page in next_page_links:
   print next_page

我使用 /apps/ 正则表达式来获取软件列表。

但我想知道是否有更好的方法来爬取下一页。我可以使用正则表达式“*page=”来匹配下一页链接。但这给出了重复的页面列表。

我怎样才能以更好的方式做到这一点？

score 2 · Accepted Answer

查看页面，有 5 页，最后一个是“...？page=4”，所以，我们知道有第一页，然后 page=1 到 page=4...

<li class="pager-last last">
<a href="/software-licenses/gpl?page=4" title="Go to last page">last »</a>
</li>

所以你可以通过类（或标题）检索它，然后解析href ...

from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
    pass # do something useful here like building a url string with pageno

python - Python Web Scraping - 导航到 Next_Page 链接并获取数据

1 回答 1

Related

Reference