html - Search for links in webpage, follow them, and return pages with no links in R

Question

I am writing an update to my rNOMADS package to include all the models on the NOMADS web site. To do this, I must search the html directory tree for each model. I do not know how deep this tree is, or how many branches it contains, beforehand. Therefore I am writing a simple web crawler to recursively search the top page for links, follow each link, and return the URLs of pages that have no more links. Such a page is the download page for model data. Here is an example of a URL that must be searched.

I want to get the addresses of all web pages below this one. I have attempted this code:

library(XML)
url <- "http://nomads.ncep.noaa.gov/cgi-bin/filter_cmcens.pl"

WebCrawler <- function(url) {
    doc <- htmlParse(url)
    links <- xpathSApply(doc, "//a/@href")
    free(doc)
    if(is.null(links)) { #If there are no links, this is the page we want, return it!
        return(url)
    } else {
       for(link in links) { #Call recursively on each link found
           print(link)
           return(WebCrawler(link))
        }
    }
}

However, I have not figured out a good way to return a list of all the "dead end" pages. Instead, this code will only return one model page, not the whole list of them. I could declare a global variable and have the URLS saved to that variable, but I am wondering if there is a better way to go about this. How should I go about constructing this function to give me a list of every single page?

score 1 · Accepted Answer

您的错误在于递归：

## THIS IS INCORRECT
for(link in links) { #Call recursively on each link found
           print(link)
           return(WebCrawler(link))   <~~~ Specifically this line
        }

这里没有递归属性，你只是沿着一个分支深入挖掘。

您不想返回. 相反，您想捕获该值，然后返回值的集合。WebCrawler(link)

ret <- vector("list", length=length(links))
for(link in links) { #Call recursively on each link found
           print(link)
           ret[[link]] <-  WebCrawler(link)   <~~~ Specifically this line
        }
return(ret) # or  return(unlist(ret))

更新：

可能值得考虑您期望的最终输出是什么？您将获得一个深度嵌套列表。如果您只想要终端节点，您可以unlist(.. recursive=TRUE, use.names=FALSE)或者您甚至可以在进行过程中取消列出，但这可能会让您更慢。可能值得对其进行基准测试以确定。

html - Search for links in webpage, follow them, and return pages with no links in R

1 回答 1

更新：

Related

Reference