如果你一次只看一个元素,我认为这as.data.frame
做得相当不错。虽然我将演示使用缩写数据(我在您的问题中编辑),但第一个元素如下所示:
raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))
as.data.frame(raw_jobs_sublist[[1]])
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
以不同的方式显示(这里只是为了多样化),它是
str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of 13 variables:
# $ id : chr "3594134"
# $ score : int 1
# $ fields.date.changed : chr "2020-04-18T00:35:00+00:00"
# $ fields.date.created : chr "2020-04-07T11:15:37+00:00"
# $ fields.date.closing : chr "2020-04-17T00:00:00+00:00"
# $ fields.country.href : chr "https://api.reliefweb.int/v1/countries/149"
# $ fields.country.name : chr "Mali"
# $ fields.country.location.lon: num -1.25
# $ fields.country.location.lat: num 17.4
# $ fields.country.id : int 149
# $ fields.country.shortname : chr "Mali"
# $ fields.country.iso3 : chr "mli"
# $ fields.title : chr "REGIONAL MANAGER West Africa"
为了对所有元素执行此操作,我们需要考虑以下几点:
- 不是所有的元素都有所有的字段,所以我们使用的任何方法都需要“填充”空白;
- 我们不想迭代地做,让我们一次将它们组合起来。
这是第一个刺:
dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
# 2 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Support Relief Group Public Health Advisor (Multiple Positions)
这也适用于data.table::rbindlist
. 它不适用于do.call(rbind.data.frame, ...)
,因为它对丢失名称的容忍度较低。(这可以轻松完成,偶尔使用这两个选项还有其他好处。)
注意:如果您对原始数据执行此操作,R 显示 a 的默认机制data.frame
会在您的控制台中显示所有文本,这可能会很烦人。如果您已经在使用dplyr
或data.table
在您的任何工作中,这两种格式都提供字符串限制,因此在控制台上更容易容忍。例如,显示整个事情:
tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
# id score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <chr>
# 1 3594~ 1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali -1.25 17.4 149 Mali mli Donor Relations~ 20966 Bamako https://api.rel~ ICCO COOPERATION 45059 Non-governmenta~ 274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~ 1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA> <NA> NA NA NA <NA> <NA> Program/Project~ 6867 <NA> https://api.rel~ US Agency for I~ 1751 Government 271 USAID http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# # fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>
data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.career_categories.name fields.career_categories.id fields.name
# <char> <int> <char> <char> <char> <char> <char> <num> <num> <int> <char> <char> <char> <int> <char>
# 1: 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri... Mali -1.25 17.35 149 Mali mli Donor Relations/Grants Management 20966 Bamako
# 2: 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Program/Project Management 6867 <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]
对于data.table
,我已经设置了一些选项来促进这一点。值得注意的是,我目前正在使用:
options(
datatable.prettyprint.char = 36,
datatable.print.topn = 10,
datatable.print.class = TRUE,
datatable.print.trunc.cols = TRUE
)
此时,您有一个data.frame
应该包含所有数据(以及NA
缺少字段的元素)。从这里开始,如果您不喜欢嵌套名称约定(例如,fields.date.changed
),那么可以使用模式或传统方法轻松地重命名它们。