0

使用下面给出的 xpath,从各个页面获取日期内容,我得到了想要的结果。但是这个页面特别是“ http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss ”,在使用chrome检查器时给出了预期的结果,而在R中使用相同的xpath没有结果。


在 chrome 中使用以下 xpath 时。

xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])'  

我明白了,

在此处输入图像描述

在使用库“xml2”在 R 中使用相同的 xpath 时。

我得到节点集 0

library(dplyr)

library(xml2)

html_page<-read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_page%>%
  xml_find_all(xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
#> {xml_nodeset (0)}

我错过了什么?

4

1 回答 1

1

从以上内容推断:

使用decapitated

library(rvest)
library(decapitated)
library(tidyverse)

doc <- decapitated::chrome_read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>

请阅读 README 和 pkg 文档,因为您需要 Chrome(最好是包中说明的单独的 Chromium 二进制文件)和环境变量设置,您必须自己调试任何设置问题。

使用splashr

splashr包需要reticulate包、Docker 和 Pythondocker模块。如果遇到问题,请进行更多的自我调试:

library(rvest)
library(splashr)
library(tidyverse)

sp <- splashr::start_splash()

doc <- render_html(splash_local, "http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath = '((//h1/parent::*/following::*|//h1/ancestor::*[position()<3]/descendant-or-self::*)[position()<150 and (string-length(text())<150 and (contains(text(), "Jan") or contains(text(), "Feb") or contains(text(), "Mar") or contains(text(), "Apr") or contains(text(), "May") or contains(text(), "Jun") or contains(text(), "Jul") or contains(text(), "Aug") or contains(text(), "Sep") or contains(text(), "Oct") or contains(text(), "Nov") or contains(text(), "Dec")))])')
## {xml_nodeset (1)}
## [1] <span class="s-blog-date">August 4, 2018</span>

killall_splash()

使用V8

为避免使用外部程序,您可以使用它V8来处理页面变量并获取内容:

library(rvest)
library(V8)
library(tidyverse)

ctx <- v8()

doc <- read_html("http://eventsgeneva.strikingly.com//blog/agenda-geneve-something-you-should-never-miss")

html_nodes(doc, xpath=".//script")[[1]] %>% # get 1st <script>
  html_text() %>% # get contents of it
  str_replace(regex("^.*window\\.", multiline=TRUE), "var $S = {};\n") %>% # make the variable usable in V8
  ctx$eval() # evaluate the javascript
## [1] "[object Object]"

pg <- ctx$get("$S") # marshall it to R

这是一个很大的结构,所以有条不紊地检查它:

str(pg, 1)
## List of 6
##  $ globalConf        :List of 26
##  $ conf              :List of 12
##  $ miniProgramAppType: NULL
##  $ blogPostData      :List of 5
##  $ siteData          :List of 5
##  $ stores            :List of 3

str(pg$blogPostData, 1)
## List of 5
##  $ blogPostMeta:List of 25
##  $ pageMeta    :List of 33
##  $ content     :List of 8
##  $ settings    :List of 2
##  $ pageMode    : NULL

str(pg$blogPostData$content, 1)
## List of 8
##  $ type            : chr "Blog.BlogData"
##  $ id              : chr "f_cc4ace2d-21ed-4b94-83a0-e83497e5afc4"
##  $ defaultValue    : NULL
##  $ showComments    : logi TRUE
##  $ showShareButtons: NULL
##  $ header          :List of 6
##  $ footer          :List of 5
##  $ sections        :'data.frame':    9 obs. of  4 variables:

内容似乎在这里:

str(pg$blogPostData$content$sections)
## 'data.frame':    9 obs. of  4 variables:
##  $ type        : chr  "Blog.Section" "Blog.Section" "Blog.Section" "Blog.Section" ...
##  $ id          : chr  "f_9ca5a1d7-ccb8-4315-9883-bcd43d271b9c" "f_4b7b30f1-387c-4cbe-aaed-ddaedea92cc1" "f_252813ac-b6cb-484b-81f5-64d7f0745c8e" "f_bd7412a4-b94b-4c5a-8cdd-a48931639dce" ...
##  $ defaultValue: logi  NA NA NA NA NA NA ...
##  $ component   :'data.frame':    9 obs. of  6 variables:
##   ..$ type        : chr  "RichText" "RichText" "RichText" "RichText" ...
##   ..$ id          : chr  "f_4e41d6f3-8449-4f66-b701-28d1bcfb08c9" "f_c27703de-8679-4916-9697-220cb8c7a74d" "f_c3c20474-99fc-434a-aff1-102d2a342450" "f_7b3e5247-39ef-42c7-b95c-f0be0b6e9728" ...
##   ..$ defaultValue: logi  FALSE NA NA NA NA NA ...
##   ..$ value       : chr  "<p style=\"text-align: justify;\">We all make our plans beforehand in order to avoid any unnecessary issues. So"| __truncated__ "<p style=\"text-align: justify;\">Take a glance at the below-listed events and plan accordingly -</p>" "<p style=\"text-align: justify;\"><u>Siestes dominicales</u> – Here you are invited to groove on the grass and "| __truncated__ "<p style=\"text-align: justify;\"><u>Sonoboat ACT</u> – Neptune is one the most popular and historic sailing bo"| __truncated__ ...
##   ..$ backupValue : logi  NA NA NA NA NA NA ...
##   ..$ version     : int  1 NA NA NA NA NA NA 1 1

单独评估valuepaste0()将它们评估为单个 HTML 块并对其进行评估。

顺便说一句,Strikingly 拥有我一段时间以来见过的最愚蠢、内容完整性/安全性低的发布解决方案之一。我知道您只是在刮擦它,但我建议任何考虑使用它们的人不要使用它们。

于 2018-08-31T12:15:08.920 回答