xml - 如何优化我的 R 代码以用于网页抓取德甲足球数据

Question

我在编写简短的 R 代码方面有一些经验，但实际上我并不擅长。我想知道您是否可以帮助我优化我的 R 代码以从德国杂志 Kicker 查询足球比分和机会。

您可以从此链接获取所有结果和分数（2012/13 赛季）：

http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2012-13/-1/0/spieltag.html

此外，我需要每场比赛的机会。您只需点击链接“分析”，您就会在页面右下角的“Spielinfo”部分找到机会。

但是，在我的代码中，我首先必须这样做。

http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2012-13/1/1441337/spielinfo_borussia-dortmund-17_werder-bremen-4.html

之后是“分析”部分（通过用 spielanalysis 替换 spielinfo）

我的代码有效，但似乎存在瓶颈，我还没有找到任何提高性能的解决方案。也许你可以帮助我更好地理解我的脚本。

我写了一个这样的函数：

library(XML)
library(stringr)
library(plyr)


kicker.daten <- function(saison)
{
  # define the url ('base url')
  url <- paste("http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/", saison, "/-1/0/spieltag.html", sep = "")

  # save source code for parsing
  doc <- htmlParse(url, isURL = TRUE, encoding = 'UTF-8')

  # Query document to extract the teams (home, away)
  # btw, really difficult to find the path
  spiele <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//div/a[@class='link']", xmlValue, "class")

  # convert to vector, don't know why but it works
  spiele <- unlist(spiele)

  # separating home and away teams
  heim <- character()
  gast <- character()

  for (i in 1 : (length(spiele)/2))
  {
    heim[i] <- spiele[2*i-1]
    gast[i] <- spiele[2*i]
  }

  # query document to extract results
  ergebnis <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//td[@class='alignright']", xmlValue, "class")

  # again convert to vector
  ergebnis <- unlist(ergebnis)

  # text processing to get the results
  ergebnis <- substring(ergebnis, 1, str_locate(ergebnis, '\\s')[,1]-1)

  # extracting scores for home and away team
  th <- as.numeric(substring(ergebnis, 1, str_locate(ergebnis, ':')[,1]-1))
  tg <- as.numeric(substring(ergebnis, str_locate(ergebnis, ':')[,1]+1), str_locate(ergebnis, '-:-')[,1]+1)

  # extract the links from 'base url' to get all the pages / url that contain the chances
  # unfortunately, I first had to go via 'Spielinfo'
  links <- xpathApply(doc, "//td[@class='aligncenter']/a", xmlGetAttr, "href")

  # convert to vector
  # substitute spielinfo with spielanalyse to extraxt the chances
  links <- unlist(links)
  links <- sub('spielinfo', 'spielanalyse', links)

  # define urls
  sub_url <- character()
  for(i in 1:length(links))
  {
    sub_url[i] <- paste("http://www.kicker.de", links[i], sep = "")
  }

  # save source code in sub-documents
  sub_doc <- llply(sub_url, htmlParse, .progress = "text", .inform = T)

  # query every subdocument
  # I guess the bottleneck is here!!!
  chancen <- list()
  for (i in 1:length(sub_doc)) 
  {
    chancen[[i]] <- xpathSApply(sub_doc[[i]], "//div[@id='ctl00_PlaceHolderHalf_ctl03_chancen']//div[@class='wert']", xmlValue, "class")
  }

  # convert to vector
  chancen <- unlist(chancen)

  # separate chances for home and away team
  ch <- as.numeric(substring(chancen, 1, str_locate(chancen, ':')[,1]-1))
  cg <- as.numeric(substring(chancen, str_locate(chancen, ':')[,1]+1), str_locate(chancen, ':')[,1]+1)

  # first dataframe to exclude missing values
  df <- na.omit(data.frame(cbind(heim, gast, ergebnis, th, tg), stringsAsFactors = F))

  # season column
  saison <- rep(saison, 306)

  # daY of match
  spieltag <- as.numeric(sort(rep(1:34, times = 9)))

  # final dataframe
  result <- data.frame(cbind(saison = saison, spieltag = spieltag, heim = df$heim, gast = df$gast, ergebnis = df$ergebnis, chancen = chancen, th = as.integer(df$th), tg = as.integer(df$tg), ch = as.integer(ch)), cg = as.integer(cg), stringsAsFactors = F)
  return(result)
}

# All results for the season 2012/13

kicker.daten("2012-13")

如果我的评论有误导性，请告诉我。老实说，我真的不知道代码到底是做什么的。

谢谢，

拉米

xml - 如何优化我的 R 代码以用于网页抓取德甲足球数据

0 回答 0

Related

Reference