3

我正在使用 Open::URI 和 Nokogiri 来抓取 Google 搜索页面:

 require 'open-uri'
 require 'nokogiri'
 url = http://www.google.co.uk/search?&q=toys&start=0&num=&complete=0
 doc = Nokogiri::HTML(open(url))
 mas = doc.css('li.g')[7]
 mas.at_css('.mas-row')

从这个结果中,我只对一个结果感兴趣:

"Amazon.co.uk: Toys - Harry Potter: Toys & Games"

我想从中获取数据"div class mas-row"

我找不到它了。我查看了“doc”变量,但找不到。之后,我查找了该“div”中的文本,并且在第一个 div 中找到了部分文本,但在下一个 div 中没有找到。

谁能帮我这个?

4

2 回答 2

3

divwith包含在mas-rowHTML 中。它是由 JavaScript 呈现的。

使用可以处理 JavaScript 的库,例如 selenium。

于 2013-10-02T16:58:28.810 回答
0

首先,它不是由 JavaScript 呈现的。其次,它可能不会返回任何内容,因为 Google 会阻止没有类似浏览器的请求user-agent什么是我的user-agent?第三,如果您只想检索一个(第一个)结果,您可以使用css/xpathnokogiri at_css/at_css快捷方式,例如:

doc.css(".yuRUbf a h3/text()")  #=> Harry Potter: Toys & Games - Amazon.co.uk ...

代码:

require 'nokogiri'
require 'httparty'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
  hl: "en"
}

response = HTTParty.get('https://www.google.com/search',
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)

# extract all organic resutlts
puts doc.css(".yuRUbf a h3/text()"),
     doc.css(".yuRUbf a/@href")

---
=begin
harry potter: Toys Store - Amazon.co.uk
harry potter toys - Amazon.com
harry potter: Toys & Games - Amazon.com
harry potter toys: Toys & Games - Amazon.com
Toys & Games - Amazon.com
Harry Potter: Toys & Games - Amazon.com
1-48 of 405 results for "harry potter lego" - Amazon
harry potter lego sets - Amazon.com
https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter
https://www.amazon.co.uk/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.co.uk/harry-potter-Toys-Store/s?k=harry+potter&rh=n%3A468292
https://www.amazon.com/harry-potter-toys/s?k=harry+potter+toys
https://www.amazon.com/harry-potter-Toys-Games/s?k=harry+potter&rh=n%3A165793011
https://www.amazon.com/harry-potter-toys-Games/s?k=harry+potter+toys&rh=n%3A165793011
https://www.amazon.com/toys/b?ie=UTF8&node=165793011
https://www.amazon.com/Toys-Games-Harry-Potter/s?rh=n%3A165793011%2Cp_lbr_characters_browse-bin%3AHarry+Potter
https://www.amazon.com/harry-potter-lego/s?k=harry+potter+lego
https://www.amazon.com/harry-potter-lego-sets/s?k=harry+potter+lego+sets
=end

或者,您可以使用来自 SerpApi的Google Organic Results API来实现此目的。这是一个带有免费计划的付费 API。主要区别之一是您只需要遍历结构化的json.

要集成的代码:

require 'google_search_results' 

params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "Amazon.co.uk: Toys - Harry Potter: Toys & Games",
  hl: "en"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

# [0] first element from organic results
puts hash_results[:organic_results][0][:title], 
     hash_results[:organic_results][0][:link]

#=> Harry Potter: Toys & Games - Amazon.co.uk
#=> https://www.amazon.co.uk/Toys-Games-Harry-Potter/s?rh=n%3A468292%2Cp_89%3AHarry+Potter

免责声明,我为 SerpApi 工作。

于 2021-08-11T10:18:27.153 回答