0

我尝试了各种不同的操作,但我的基本问题是:

url<- "http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08"
data <- readHTMLTable(url,header = TRUE,as.data.frame =TRUE,which=2)
typeof(data)

我的数据看起来不错,但我无法将其强制转换为数据框。我不知道是什么阻止了我。

4

1 回答 1

2

正如帖子评论中指出的那样,您的代码实际上工作正常。您可以通过以下方式获取字符串而不是因子:

url<- "http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08"
data <- readHTMLTable(url, header=TRUE, as.data.frame=TRUE, which=2, 
                      stringsAsFactors=FALSE)

下面是如何使用rvest包来做到这一点,这真的很出色,尤其是当有多个表或奇怪的嵌套表时。而且,有一些dplyr

在这种情况下,有不止一张桌子,第二张就是你想要的。值得庆幸的是,它的格式非常好。下面的代码从页面中提取所有表格(使用 CSS 选择器),然后使用方便magrittr extract2来避免奇怪/丑陋的[[]]用法。

管道成语(从 hadleyverse 开始magrittr并且现在在许多 hadleyverse 中使用)从左到右“推送”或“流动”数据,而不是从嵌套括号调用中“弹出”数据。

library(rvest)
library(magrittr)
library(dplyr)

pg <- html("http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08")
dat <- pg %>% html_nodes("table") %>% extract2(2) %>% html_table(header=TRUE)
glimpse(dat)

## Observations: 48
## Variables:
## $ SD         (chr) "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "...
## $ SP         (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24...
## $ Gas        (chr) "3,467", "3,522", "3,594", "3,529", "2,811", "2,538", "2,520", "2,489", "2,498", "2,5...
## $ Coal       (chr) "8,261", "8,062", "7,876", "7,437", "6,751", "6,799", "6,621", "6,428", "6,586", "6,2...
## $ Nuclear    (chr) "7,495", "7,553", "7,641", "7,676", "7,674", "7,672", "7,676", "7,677", "7,672", "7,6...
## $ Hydro      (int) 737, 729, 666, 651, 646, 647, 645, 648, 658, 729, 734, 736, 740, 738, 740, 741, 751, ...
## $ Net Pumped (chr) "-438", "-84", "-504", "-860", "-1,092", "-1,118", "-1,396", "-1,700", "-1,606", "-1,...
## $ Wind       (chr) "4,675", "4,795", "4,623", "4,572", "4,647", "4,570", "4,377", "4,445", "4,602", "4,5...
## $ OCGT       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Oil        (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Biomass    (chr) "1,078", "1,079", "1,081", "1,048", "1,005", "1,022", "1,086", "1,035", "1,072", "1,0...
## $ French Int (chr) "480", "480", "680", "678", "1,614", "1,626", "1,532", "1,536", "772", "772", "504", ...
## $ Dutch Int  (chr) "860", "852", "874", "838", "850", "866", "848", "830", "830", "866", "862", "862", "...
## $ NI Int     (int) 22, -72, 2, -16, -30, -50, 4, 4, -122, -138, -108, -114, -2, 16, 24, 24, 28, -30, -42...
## $ Eire Int   (int) 170, 190, 142, 142, 142, 114, 114, 114, 114, 112, 88, 50, 16, 16, 16, 42, 18, -72, -1...
## $ Net Supply (chr) "26,807", "27,106", "26,675", "25,695", "25,018", "24,686", "24,027", "23,506", "23,0...

您也可以这样做:

html_table(extract2(html_nodes(pg, "table"), 2), header=TRUE)

如果您不喜欢或通常使用管道。

然后,您可以对列进行一些基本的清理以获得有用的数字/日期值:

dat %>% 
  mutate(SD=as.Date(SD),
         Gas=as.numeric(gsub(",", "", Gas)),
         Coal=as.numeric(gsub(",", "", Coal)),
         Nuclear=as.numeric(gsub(",", "", Nuclear)),
         `Net Pumped`=as.numeric(gsub(",", "", `Net Pumped`)),
         `Wind`=as.numeric(gsub(",", "", `Wind`)),
         Biomass=as.numeric(gsub(",", "", Biomass)),
         `French Int`=as.numeric(gsub(",", "", `French Int`)),
         `Dutch Int`=as.numeric(gsub(",", "", `Dutch Int`)),
         `Net Supply`=as.numeric(gsub(",", "", `Net Supply`))) -> dat

glimpse(dat)

## Observations: 48
## Variables:
## $ SD         (date) 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, ...
## $ SP         (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24...
## $ Gas        (dbl) 3467, 3522, 3594, 3529, 2811, 2538, 2520, 2489, 2498, 2543, 2531, 2522, 2627, 2729, 2...
## $ Coal       (dbl) 8261, 8062, 7876, 7437, 6751, 6799, 6621, 6428, 6586, 6229, 6194, 6299, 6455, 6639, 6...
## $ Nuclear    (dbl) 7495, 7553, 7641, 7676, 7674, 7672, 7676, 7677, 7672, 7670, 7673, 7677, 7677, 7681, 7...
## $ Hydro      (int) 737, 729, 666, 651, 646, 647, 645, 648, 658, 729, 734, 736, 740, 738, 740, 741, 751, ...
## $ Net Pumped (dbl) -438, -84, -504, -860, -1092, -1118, -1396, -1700, -1606, -1632, -1344, -1052, -1342,...
## $ Wind       (dbl) 4675, 4795, 4623, 4572, 4647, 4570, 4377, 4445, 4602, 4570, 4529, 4512, 4312, 3976, 3...
## $ OCGT       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Oil        (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Biomass    (dbl) 1078, 1079, 1081, 1048, 1005, 1022, 1086, 1035, 1072, 1086, 1084, 1085, 1086, 1082, 1...
## $ French Int (dbl) 480, 480, 680, 678, 1614, 1626, 1532, 1536, 772, 772, 504, 502, 1598, 1602, 1878, 188...
## $ Dutch Int  (dbl) 860, 852, 874, 838, 850, 866, 848, 830, 830, 866, 862, 862, 884, 846, 942, 914, 1032,...
## $ NI Int     (int) 22, -72, 2, -16, -30, -50, 4, 4, -122, -138, -108, -114, -2, 16, 24, 24, 28, -30, -42...
## $ Eire Int   (int) 170, 190, 142, 142, 142, 114, 114, 114, 114, 112, 88, 50, 16, 16, 16, 42, 18, -72, -1...
## $ Net Supply (dbl) 26807, 27106, 26675, 25695, 25018, 24686, 24027, 23506, 23076, 22807, 22747, 23079, 2...
于 2015-03-12T19:22:50.547 回答