r - 从 OpenIR 中的相似节点中提取属性

Question

该任务的目标是在 IR 的搜索结果页面中提取关于论文标题的“href”，并将它们作为数据框。这个结果页面的结构不是很好：论文标题、问题信息、作者和下载按钮在同一个字段中，仅用“span”（在“title”、“issue”和“authors”之间）和“sup”分隔”（在“作者”内）。

results<-"http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued"
library(rvest)
resultsource <- read_html(results)
itemLine <- html_node(resultsource, xpath ='//tr[@class="itemLine"]')
# gether labels and values of item metadata in miscTable2
titleLine <- html_nodes(itemLine, xpath ='//span/a[@href][@target]')
titlehref <- xml_attrs(titleLine, "href")
resultstxt <- html_text(titleLine, trim = TRUE)

上面的程序运行没有错误，但“titleLine”有很多冗余，“titlehref”只有一个“class”itemLine“的比赛，但根本没有URL。我的问题是：

如何准确定位论文标题的href？我使用第二层“html_nodes”来保存所有目标href。但是，“sup”标签下的“href”仍在“titleLine”中，“target”也在。我们可以使用“target”属性来定位正确的“href”但不让它们出现在“titleLine”中吗？
我们如何定位具有复杂“值”的属性？在上面的程序中，我只使用“href”。我以前尝试过使用“xpath 样式”，但没有帮助。我想使用命名空间来识别论文的URL，但是我看到ns可能只能从“xmlns”属性中提取，并且无法手动分配（如titlehref <- xml_attrs(titleLine, "href", ns=”http://ir.las.ac.cn/handle”)）

如何拟合这个IR的结构才能得到正确的结果？非常感谢。

score 0 · Accepted Answer

您可以索引所需的<span>目标以及<td>

library(rvest)

pg <- read_html("http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued")

html_nodes(pg, xpath=".//tr[@class='itemLine']/td[2]/span[1]/a") %>% 
  html_text()
##  [1] "Data-driven Discovery: A New Era of Exploiting the Literature and Data"                                                                                       
##  [2] "Contents Index to Volume 1"                                                                                                                                   
##  [3] "Topic Detection Based on Weak Tie Analysis: A Case Study of LIS Research"                                                                                     
##  [4] "Open Peer Review in Scientific Publishing: A Web Mining Study of <i>PeerJ</i> Authors and Reviewers"                                                          
##  [5] "Mapping Diversity of Publication Patterns in the Social Sciences and Humanities: An Approach Making Use of Fuzzy Cluster Analysis"                            
##  [6] "Under-reporting of Adverse Events in the Biomedical Literature"                                                                                               
##  [7] "Predictive Characteristics of Co-authorship Networks: Comparing the Unweighted, Weighted, and Bipartite Cases"                                                
##  [8] "International Conference on Scientometrics & Informetrics October16-20, 2017, Wuhan · China"                                                                  
##  [9] "Identification and Analysis of Multi-tasking Product Information Search Sessions with Query Logs"                                                             
## [10] "The 1<sup>st</sup> International Conference on Datadriven Knowledge Discovery: When Data Science Meets Information Science. June 19-22, 2016, Beijing · China"
## [11] "The Power-weakness Ratios (PWR) as a Journal Indicator: Testing the “Tournaments” Metaphor in Citation Impact Studies"                                        
## [12] "Document Type Profiles in <i>Nature, Science</i>, and <i>PNAS</i>: Journal and Country Level"                                                                 
## [13] "Can Automatic Classification Help to Increase Accuracy in Data Collection?"                                                                                   
## [14] "Knowledge Representation in Patient Safety Reporting: An Ontological Approach"                                                                                
## [15] "Information Science Roles in the Emerging Field of Data Science"                                                                                              
## [16] "Data Science Altmetrics"                                                                                                                                      
## [17] "Comparative Study of Trace Metrics between Bibliometrics and Patentometrics"                                                                                  
## [18] "Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation"                                       
## [19] "Mining Related Articles for Automatic Journal Cataloging"                                                                                                     
## [20] "Critical Factors for Personal Cloud Storage Adoption in ChinaCritical Factors for Personal Cloud Storage Adoption in China"

^^ 中的 HTML 标记（如“`...”）本身就是错误（它们也出现在呈现的浏览器视图中）。我认为有人在 XSS 预防方面走得太远了。

score 0 · Accepted Answer

尝试这个。

library(rvest)
url<-"http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued"
page<-html_session(url)

# DATA EXTRACTION
title<-html_nodes(page,css="strong") %>% html_text()
title<-title[5:length(title)]
download_link<-html_nodes(page, css= "span:nth-child(7) a+ a") %>% html_attr("href")
issue_information<-html_nodes(page, css= "i") %>% html_text()
authors<-html_nodes(page,css=".itemLine span:nth-child(5)") %>% html_text()

# CONVERT TO DATA FRAME
k<-data.frame(title,download_link,issue_information,authors)

在每一页上运行代码以获取完整的数据框。

为了定位不同的元素，我使用了“SELECTOR GADGET”chrome add in，然后在代码中使用。

r - 从 OpenIR 中的相似节点中提取属性

2 回答 2

Related

Reference