我正在研究为一个与谷歌结构相似的网站构建一个解析器(即一堆连续的结果页面,每个页面都有一个感兴趣的内容表)。
Selenium 包(用于基于页面元素的站点导航)和 BeautifulSoup(用于 html 解析)的组合似乎是收获书面内容的首选武器。你可能会发现它们也很有用,虽然我不知道谷歌有什么样的防御措施来阻止抓取。
使用 selenium、beautifulsoup 和 geckodriver 的 Mozilla Firefox 可能实现:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
在此之后,只需调用:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
请注意,此脚本假定您尝试抓取的结果页面是按顺序编号的,并且可以使用“find_element_by_link_text”从抓取页面的 html 中检索这些数字。有关从页面检索元素的其他策略,请参阅此处的 selenium 文档。
另外,请注意,您需要下载它所依赖的包,以及 selenium 与浏览器通信所需的驱动程序(在这种情况下,geckodriver,下载 geckodriver,将其放在一个文件夹中,然后参考中的可执行文件'可执行路径')
如果您最终使用了这些包,它可以帮助使用 time 包(原生于 python)分散您的服务器请求,以避免超过您正在抓取的服务器允许的最大请求数。我自己的项目最终不需要它,但请参阅此处,原始问题的第二个答案,以获取第四个代码块中使用的 time 模块的实现示例。
Yeeeeaaaahhh ...如果具有更高代表的人可以编辑并添加一些指向beautifulsoup,selenium和time文档的链接,那就太好了,thaaaanks。