xml - 循环函数中的规避错误（用于从 Twitter 中提取数据）

Question

我创建了一个循环函数，它使用搜索 API 以一定的时间间隔（假设每 5 分钟一次）提取推文。这个函数做了它应该做的事情：连接到 twitter，提取包含特定关键字的推文，并将它们保存在 csv 文件中。但是偶尔（每天 2-3 次）循环会因为以下两个错误之一而停止：

htmlTreeParse 中的错误（URL，useInternal = TRUE）：为http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10创建解析器时出错
UseMethod（“xmlNamespaceDefinitions”）中的错误：没有适用于“xmlNamespaceDefinitions”的方法应用于“NULL”类的对象

我希望你能通过回答我的一些问题来帮助我处理这些错误：

是什么导致这些错误发生？
如何调整我的代码以避免这些错误？
如果遇到错误（例如通过使用 Try 函数），我如何“强制”循环继续运行？

我的功能（基于网上找到的几个脚本）如下：

    library(XML)   # htmlTreeParse

    twitter.search <- "Keyword"

    QUERY <- URLencode(twitter.search)

    # Set time loop (in seconds)
    d_time = 300
    number_of_times = 3000

    for(i in 1:number_of_times){

    tweets <- NULL
    tweet.count <- 0
    page <- 1
    read.more <- TRUE

    while (read.more)
    {
    # construct Twitter search URL
    URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

    # Extract list of "entry" nodes
    entry     <- getNodeSet(XML, "//entry")

    read.more <- (length(entry) > 0)
    if (read.more)
    {
    for (i in 1:length(entry))
    {
    subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

    published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

    published  <- gsub("Z"," ", gsub("T"," ",published) )

    # Convert from GMT to central time
    time.gmt   <- as.POSIXct(published,"GMT")
    local.time <- format(time.gmt, tz="Europe/Amsterdam")

    title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

    author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

    tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

    entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
    tweet.count <- tweet.count + 1
    rownames(entry.frame) <- tweet.count
    tweets <- rbind(tweets, entry.frame)
    }
    page <- page + 1
    read.more <- (page <= 15)   # Seems to be 15 page limit
    }
    }

    names(tweets)

    # top 15 tweeters
    #sort(table(tweets$author),decreasing=TRUE)[1:15]

    write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")

    Sys.sleep(d_time)

    } # end if

score 1 · Accepted Answer

这是我使用tryTwitter API 解决类似问题的解决方案。

我在向 Twitter API 询问一长串 Twitter 用户中每个人的关注者数量。当用户的帐户受到保护时，我会收到一个错误，并且在我输入try函数之前循环会中断。使用try允许循环通过跳到列表中的下一个人来继续工作。

这是设置

# load library
library(twitteR)
#
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500) 
# convert search results to a data frame
df <- do.call("rbind", lapply(s, as.data.frame)) 
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with 
users.df <- data.frame(users = users, 
                       followers = "", stringsAsFactors = FALSE)

这是try处理错误的循环，同时使用从 Twitter API 获得的关注者计数填充 users$followers

for (i in 1:nrow(users.df)) 
    {
    # tell the loop to skip a user if their account is protected 
    # or some other error occurs  
    result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
    if(class(result) == "try-error") next;
    # get the number of followers for each user
    users.df$followers[i] <- getUser(users.df$users[i])$followersCount
    # tell the loop to pause for 60 s between iterations to 
    # avoid exceeding the Twitter API request limit
    print('Sleeping for 60 seconds...')
    Sys.sleep(60); 
    }
#
# Now inspect users.df to see the follower data

score 0 · Accepted Answer

我的猜测是，您的问题对应于 twitter（或您与网络的连接）出现故障或速度缓慢或其他原因，因此得到了不好的结果。你试过设置

options(error = recover)

然后下次遇到错误时，会为您提供一个不错的浏览器环境。

xml - 循环函数中的规避错误（用于从 Twitter 中提取数据）

2 回答 2

Related

Reference