2

I am writing an update to my rNOMADS package to include all the models on the NOMADS web site. To do this, I must search the html directory tree for each model. I do not know how deep this tree is, or how many branches it contains, beforehand. Therefore I am writing a simple web crawler to recursively search the top page for links, follow each link, and return the URLs of pages that have no more links. Such a page is the download page for model data. Here is an example of a URL that must be searched.

I want to get the addresses of all web pages below this one. I have attempted this code:

library(XML)
url <- "http://nomads.ncep.noaa.gov/cgi-bin/filter_cmcens.pl"

WebCrawler <- function(url) {
    doc <- htmlParse(url)
    links <- xpathSApply(doc, "//a/@href")
    free(doc)
    if(is.null(links)) { #If there are no links, this is the page we want, return it!
        return(url)
    } else {
       for(link in links) { #Call recursively on each link found
           print(link)
           return(WebCrawler(link))
        }
    }
}

However, I have not figured out a good way to return a list of all the "dead end" pages. Instead, this code will only return one model page, not the whole list of them. I could declare a global variable and have the URLS saved to that variable, but I am wondering if there is a better way to go about this. How should I go about constructing this function to give me a list of every single page?

4

1 回答 1

1

您的错误在于递归:

## THIS IS INCORRECT
for(link in links) { #Call recursively on each link found
           print(link)
           return(WebCrawler(link))   <~~~ Specifically this line
        }

这里没有递归属性,你只是沿着一个分支深入挖掘。

      *
    /   \
    \
     \
      \ 
       \
        * 

您不想返回. 相反,您想捕获该值,然后返回值的集合。WebCrawler(link)

ret <- vector("list", length=length(links))
for(link in links) { #Call recursively on each link found
           print(link)
           ret[[link]] <-  WebCrawler(link)   <~~~ Specifically this line
        }
return(ret) # or  return(unlist(ret))

更新:

可能值得考虑您期望的最终输出是什么?您将获得一个深度嵌套列表。如果您只想要终端节点,您可以unlist(.. recursive=TRUE, use.names=FALSE)或者您甚至可以在进行过程中取消列出,但这可能会让您更慢。可能值得对其进行基准测试以确定。

于 2013-10-15T14:50:11.907 回答