r - 无限滚动抓取动态电子商务页面

Question

我rvest在 R 中使用来进行一些抓取。我知道一些 HTML 和 CSS。

我想获取 URI 的每个产品的价格：

http://www.linio.com.co/tecnologia/celulares-telefonia-gps/

当您在页面上向下移动时（当您进行一些滚动时），新项目就会加载。

到目前为止我所做的：

Linio_Celulares <- html("http://www.linio.com.co/celulares-telefonia-gps/")

Linio_Celulares %>%
  html_nodes(".product-itm-price-new") %>%
  html_text()

我得到了我需要的东西，但仅适用于前 25 个元素（默认加载的那些）。

 [1] "$ 1.999.900" "$ 1.999.900" "$ 1.999.900" "$ 2.299.900" "$ 2.279.900"
 [6] "$ 2.279.900" "$ 1.159.900" "$ 1.749.900" "$ 1.879.900" "$ 189.900"  
[11] "$ 2.299.900" "$ 2.499.900" "$ 2.499.900" "$ 2.799.000" "$ 529.900"  
[16] "$ 2.699.900" "$ 2.149.900" "$ 189.900"   "$ 2.549.900" "$ 1.395.900"
[21] "$ 249.900"   "$ 41.900"    "$ 319.900"   "$ 149.900"

问题：如何获取这个动态部分的所有元素？

我想，我可以滚动页面直到所有元素都加载完毕，然后使用 html(URL)。但这似乎需要做很多工作（我计划在不同的部分这样做）。应该有一个程序化的工作。

score 25 · Accepted Answer

正如@nrussell 建议的那样，您可以RSelenium在获取源代码之前以编程方式向下滚动页面。

例如，您可以这样做：

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){      
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()

#parse it
html(page_source[[1]]) %>% html_nodes(".product-itm-price-new") %>%
  html_text()

score -1 · Accepted Answer

library(rvest)
url<-"https://www.linio.com.co/c/celulares-y-tablets?page=1"
page<-html_session(url)

html_nodes(page,css=".price-secondary") %>% html_text()

循环浏览网站https://www.linio.com.co/c/celulares-y-tablets?page=2和3等等，你会很容易刮取数据

编辑日期为 2019 年 7 月 5 日

网站元素发生了变化。因此新代码

library(rvest)
url<-"https://www.linio.com.co/c/celulares-y-tablets?page=1"
page<-html_session(url)

html_nodes(page,css=".price-main") %>% html_text()

r - 无限滚动抓取动态电子商务页面

2 回答 2

Related

Reference