1

我正在从网站上的表格中提取信息。该表的输出如下所示(见下文)。

1. Saturday
2. 4:00 PM
3. 5:30 PM
4. Sunday
5. 8:30 AM
6. 10:00 AM

我真的需要它像这样度过(见下文)。我不认为我可以用这个html_table()函数来转换它,但我希望有人知道如何在 R 中重新格式化它。

1. Saturday    4:00 PM
2. Saturday    5:30 PM
3. Sunday      8:30 AM
4. Sunday      10:00 AM

这是我正在使用的代码:

urls <- 'https://www.life.church/edmond/'

times <- function(x){ 
  try( x %>%
         read_html()%>%
         html_table(header = F)%>%
         data.frame(x))

}


#Apply function to the urls
m <- lapply(urls, times)

#Convert to a dataframe 
data <-data.frame(unnest(tibble(m)))
4

1 回答 1

1

这就是我会做的:

library(dplyr)
library(xml2)
library(rvest)
library(tidyr)
library(purrr)

times <- function(x){ 
  try(
    x %>%
      read_html() %>%
      html_table(header = FALSE) %>% 
      flatten() %>% 
      as_tibble()
  )
}

urls <- c('https://www.life.church/edmond/', 'https://www.life.church/fortworth/')

lapply(urls, times) %>% 
  set_names(urls) %>% 
  bind_rows(.id = "URL") %>% 
  separate(X1, into = c("Time", "Day"), sep = "(?=^\\D)") %>% 
  fill(Day) %>% 
  filter(Time != "") %>% 
  select(URL, Day, Time)
# A tibble: 16 x 3
   URL                                Day       Time    
   <chr>                              <chr>     <chr>   
 1 https://www.life.church/edmond/    Saturday  4:00 PM 
 2 https://www.life.church/edmond/    Saturday  5:30 PM 
 3 https://www.life.church/edmond/    Sunday    8:30 AM 
 4 https://www.life.church/edmond/    Sunday    10:00 AM
 5 https://www.life.church/edmond/    Sunday    11:30 AM
 6 https://www.life.church/edmond/    Sunday    1:00 PM 
 7 https://www.life.church/edmond/    Sunday    4:00 PM 
 8 https://www.life.church/edmond/    Sunday    5:30 PM 
 9 https://www.life.church/edmond/    Wednesday 7:00 PM 
10 https://www.life.church/fortworth/ Saturday  4:00 PM 
11 https://www.life.church/fortworth/ Saturday  5:30 PM 
12 https://www.life.church/fortworth/ Sunday    8:30 AM 
13 https://www.life.church/fortworth/ Sunday    10:00 AM
14 https://www.life.church/fortworth/ Sunday    11:30 AM
15 https://www.life.church/fortworth/ Sunday    1:00 PM 
16 https://www.life.church/fortworth/ Wednesday 7:00 PM

separate()使用前瞻正则表达式将以数字开头的条目分隔到新列中Day

于 2020-01-15T16:53:37.623 回答