I am writing an update to my rNOMADS package to include all the models on the NOMADS web site. To do this, I must search the html directory tree for each model. I do not know how deep this tree is, or how many branches it contains, beforehand. Therefore I am writing a simple web crawler to recursively search the top page for links, follow each link, and return the URLs of pages that have no more links. Such a page is the download page for model data. Here is an example of a URL that must be searched.
I want to get the addresses of all web pages below this one. I have attempted this code:
library(XML)
url <- "http://nomads.ncep.noaa.gov/cgi-bin/filter_cmcens.pl"
WebCrawler <- function(url) {
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/@href")
free(doc)
if(is.null(links)) { #If there are no links, this is the page we want, return it!
return(url)
} else {
for(link in links) { #Call recursively on each link found
print(link)
return(WebCrawler(link))
}
}
}
However, I have not figured out a good way to return a list of all the "dead end" pages. Instead, this code will only return one model page, not the whole list of them. I could declare a global variable and have the URLS saved to that variable, but I am wondering if there is a better way to go about this. How should I go about constructing this function to give me a list of every single page?