xml - 从R中的网页顺序检索数据

Question

我在网络上进行了高级搜索并获得了一些结果。对于每个结果，我有兴趣提取 2 个字段，“Referencia：”和“CIF”。

#This is the url with the results of the search
url="http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF
&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013
&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar"

#This is the url of one of the results.
example=http://www.boe.es/buscar/doc.php?id=BOE-B-2013-15895

CIF 字段的格式通常为 X00000000 或 X-00000000X=c("A","B")和0=0:9 和参考字段在示例中为 BOE-B-2013-15895，CIF B-32210196

你能帮我从R做吗？

score 1 · Accepted Answer

1) 获得Referencia 是小菜一碟

substrRight <- function(x, n){
  sapply(x, function(xx)
  substr(xx, (nchar(xx)-n+1), nchar(xx)))
}

library(XML)
u<-"http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF%20&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013%20&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar" #link
doc1<-htmlParse(u) 'get html'
kbbRoot <- xmlRoot(doc1) #parse it into xml
els<-getNodeSet(kbbRoot,"//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'resultado-busqueda-link-defecto', ' ' ))]") #get all links by xpath
links<-sapply(els, function(el) xmlGetAttr(el, "href")) #get inner (start with .../)
links<-sapply(links, function(x)  substr(x,start=3,stop=nchar(x))) #delete ../  
links<-sapply(links, function(x)  paste("http://www.boe.es", x,sep=""))#generate correct link
Referencia<-sapply(links, function(x) substrRight(x,16)) # get referencia from links

2）CIF要复杂得多。你必须使用正则表达式。不幸的是，我在这方面并不强。所以在论坛上问其他人：“应该使用哪种正则表达式从字符串中获取 CIF 值？”

CIFRA<-function (u){
  doc1<-htmlParse(u)#get html
  kbbRoot <- xmlRoot(doc1)# parse it
  els<-getNodeSet(kbbRoot,"//*[contains(concat('', @class,''), concat('', 'parrafo', '' ))]")#select text
  l<-sapply(els, xmlValue) #analyse each sentences
  x<-regexpr(pattern="[A-Z][0-9]+",text=l)#Try to find CIF by using RegEXP
  #regexp return position in string
  ind<-which.max(x) #'number of row with CIF'
  st<- x[ind]-3 #start position
  en<-st+attr(x, "match.length")[ind]-1 #finish
  res<-substring(l[ind],st,en) #select text between start and finish
}

CIF<-sapply（链接，函数（x）CIFRA（x））

score 1 · Accepted Answer

1

要获取内容，请查看httr包。你可以使用类似的东西

content (GET (url))

于 2013-04-25T16:51:01.340 回答

xml - 从R中的网页顺序检索数据

2 回答 2

Related

Reference