目标:通过搜索字符串在 google 上搜索并抓取 url、标题和与 url 标题一起发布的小描述。

我有以下代码,目前我的代码只给出前 10 个结果,这是一页的默认谷歌限制。我不确定如何在网页抓取期间真正处理分页。此外,当我查看实际页面结果和打印出来的内容时,存在差异。我也不确定解析跨度元素的最佳方法是什么。


<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span


from BeautifulSoup import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')


<span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span>
<span class="st"><span class="f">Feb 16, 2012 &ndash; </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span>
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
<span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span>
<span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> &middot; Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is  the current focus of development <b>...</b><br /></span>
<span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span>
<span class="st"><span class="f">Jul 6, 2009 &ndash; </span>taken from the motion picture &quot;Alice in wonderland&quot; (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span>
<span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span>
<span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We&#39;re thrilled to announce the cast of <em>Beautiful Soup&#39;s</em> upcoming production of <b>...</b><br /></span>
<span class="st"><span class="f">Mar 15, 2009 &ndash; </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#39;s simply no way around it; so I should better confess it in <b>...</b><br /></span>

Google 搜索页面结果具有以下结构:

<li class="g">
<div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ">
<h3 class="r">
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
<div class="f kv">
<div id="poS5" class="esc slp" style="display:none">
<div class="f slp">3 answers&nbsp;-&nbsp;Jan 16, 2009</div>
<span class="st">
I read this without finding the solution:
The "normal" way is to: Go to the
<em>Beautiful Soup</em>
web site,
Brian beat me too it, but since I already have
<h3 id="tbpr_6" class="tbpr" style="display:none">



3 回答 3



>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
于 2012-07-17T05:14:03.180 回答

我构造了一个简单的 html正则表达式,然后在清理后的字符串上调用 replace 函数以删除点

import re

p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')

<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things, 
于 2012-07-17T17:59:57.740 回答

要从span标签中获取文本元素,您可以使用提供.text的/get_text()方法。做所有艰苦的举重,您无需担心如何摆脱标签。beautifulsoup Bs4<em>

代码和完整示例(Google不会显示超过 ~400 个结果。):

from bs4 import BeautifulSoup
import requests, lxml, urllib.parse

def print_extracted_data_from_url(url):
    headers = {
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_text = container.find('h3', class_='LC20lb DKV0Md').text
        head_sum = container.find('div', class_='IsZvec').text
        head_link = container.a['href']

    return soup.select_one('a#pnnext')

def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=coca cola')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com',

        next_page_node = print_extracted_data_from_url(next_page_url)



Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola

The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Contact Us · ‎Careers · ‎Coca-Cola · ‎Coca-Cola System

2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.

Together Tastes Better | Coca-Cola®
Coca-Cola is pairing up with celebrity chefs, talented athletes and more surprise guests all summer long to bring you and your loved ones together over the love ...

或者,您可以使用来自 SerpApi 的Google 搜索引擎结果 API来实现此目的。这是一个带有免费计划的付费 API 查看Playground进行测试。


import os
from serpapi import GoogleSearch

def scrape():
  params = {
    "engine": "google",
    "q": "coca cola",
    "api_key": os.getenv("API_KEY"),

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")


Results from SerpApi

Current page: 1
Title: The Coca-Cola Company: Refresh the World. Make a Difference
Link: https://www.coca-colacompany.com/home

Title: Coca-Cola
Link: https://www.coca-cola.com/

Title: Together Tastes Better | Coca-Cola®
Link: https://us.coca-cola.com/

Title: Coca-Cola - Wikipedia
Link: https://en.wikipedia.org/wiki/Coca-Cola

Title: Coca-Cola - Home | Facebook
Link: https://www.facebook.com/Coca-Cola/

Title: The Coca-Cola Company | LinkedIn
Link: https://www.linkedin.com/company/the-coca-cola-company

Title: Coca-Cola UNITED: Home
Link: https://cocacolaunited.com/

Title: World of Coca-Cola: Atlanta Museum & Tourist Attraction
Link: https://www.worldofcoca-cola.com/

Current page: 2
Title: Coca-Cola (@CocaCola) | Twitter
Link: https://twitter.com/cocacola?lang=en

免责声明,我为 SerpApi 工作。

于 2021-04-13T08:52:09.430 回答