r - Purrr-Fection：寻找利用 Purrr 的条件数据帧操作的优雅解决方案

Question

的背景

我有一个问题，可能有多种解决方案，但我相信有一个尚未发现的优雅解决方案利用 purrr。

示例代码

我有一个如下的大数据框，为此我在下面提供了一个示例：

library(tibble)
library(ggmap)
library(purrr)
library(dplyr)

# Define Example Data
df <- frame_data(
  ~Street,                ~City,        ~State,     ~Zip,  ~lon,      ~lat,
  "226 W 46th St",        "New York",   "New York", 10036, -73.9867,  40.75902,
  "5th Ave",              "New York",   "New York", 10022, NA,        NA,
  "75 Broadway",          "New York",   "New York", 10006, -74.01205, 40.70814,
  "350 5th Ave",          "New York",   "New York", 10118, -73.98566, 40.74871,
  "20 Sagamore Hill Rd",  "Oyster Bay", "New York", 11771, NA,        NA,
  "45 Rockefeller Plaza", "New York",   "New York", 10111, -73.97771, 40.75915
)

挑战

我想对lon和lat列当前所在的所有位置进行地理标记NA。有很多方法可以解决这个问题，其中一种如下所示：

# Safe Code is Great Code
safe_geocode <- safely(geocode)

# Identify Data to be Geotagged by Absence of lon and lat
data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat))

# GeoTag Addresses of Missing Data Points
fullAddress <- paste(data_to_be_geotagged$Street,
                     data_to_be_geotagged$City,
                     data_to_be_geotagged$State,
                     data_to_be_geotagged$Zip,
                     sep = ", ")

fullAddress %>% 
  map(safe_geocode) %>% 
  map("result") %>%
  plyr::ldply()

问题

虽然我可以使上述工作，甚至将新识别的坐标lon和lat坐标重新放入原始数据框中，但整个方案感觉很脏。我相信有一种优雅的方法可以利用管道和 purrr 来遍历数据框，并根据和的缺失有条件地对位置进行地理lon标记lat。

我遇到了许多兔子洞，包括在构建完整地址（以及和）purrr::pmap时尝试并行遍历多个列。尽管如此，我在构建任何可以称为优雅解决方案的东西方面都做得不够好。rowwise()by_row()

提供的任何见解将不胜感激。

score 7 · Accepted Answer

确实，您希望避免不必要的调用，geocode因为它很慢，而且如果您使用的是 Google，那么您每天只有 2500 个查询。因此，最好在同一个调用中创建两个列，这可以通过列表列来完成，使用do或自连接创建新版本的 data.frame。

1.带有列表栏

使用列表列，您可以创建一个新版本的lon和lat，ifelse如果有NAs，则进行地理编码，否则只需复制现有值。然后，摆脱旧版本的列并取消嵌套新的列：

library(dplyr)
library(ggmap)
library(tidyr)    # For `unnest`

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Add a list column. If lon or lat are NA,
    mutate(data = ifelse(any(is.na(c(lon, lat))), 
                         # return a data.frame of the geocoded results,
                         list(geocode(paste(Street, City, State, Zip))), 
                         # else return a data.frame of existing columns.
                         list(data_frame(lon = lon, lat = lat)))) %>% 
    # Remove old columns
    select(-lon, -lat) %>% 
    # Unnest newly created ones from list column
    unnest(data)

## # A tibble: 6 × 6
##                 Street       City    State   Zip       lon      lat
##                  <chr>      <chr>    <chr> <dbl>     <dbl>    <dbl>
## 1        226 W 46th St   New York New York 10036 -73.98670 40.75902
## 2              5th Ave   New York New York 10022 -73.97491 40.76167
## 3          75 Broadway   New York New York 10006 -74.01205 40.70814
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915

2. 有`do`

do，另一方面，从旧的部分创建一个全新的 data.frame。它需要稍微笨拙的$符号，用.管道表示分组的 data.frame。使用ifandelse而不是ifelse可以避免将结果嵌套在列表中（无论如何，它们必须在上面）。

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Make a new data.frame from the first four columns and the geocode results or existing lon/lat
    do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
        geocode(paste(.[1:4], collapse = ' '))
    } else {
        .[5:6]
    }))

它返回与第一个版本完全相同的东西。

3. 在一个子集上，用自连接重组

如果ifelse过于混乱，您可以只对一个子集进行地理编码，然后通过将行绑定到重新组合anti_join，即所有在其中df但不是子集的行.：

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% 
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

它返回相同的内容，但新的地理编码行位于顶部。相同的方法适用于列表列 or do，但由于不需要组合两组列，bind_cols因此可以解决问题。

4.在一个子集上`mutate_geocode`

ggmap实际上包括一个mutate_geocode函数，当传递一个 data.frame 和一列地址时，它将添加 lon 和 lat 列。它有一个问题：它不能接受超过地址的列名，因此需要一个包含整个地址的列。因此，虽然这个版本可能非常好，但它需要创建和删除一个包含整个地址的额外列，使其不简洁：

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    mutate(address = paste(Street, City, State, Zip)) %>%    # make an address column
    mutate_geocode(address) %>% 
    select(-address) %>%    # get rid of address column
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

##                 Street       City    State   Zip       lon      lat
## 1              5th Ave   New York New York 10022 -73.97491 40.76167
## 2  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5          75 Broadway   New York New York 10006 -74.01205 40.70814
## 6        226 W 46th St   New York New York 10036 -73.98670 40.75902

5.基础R

Base R 可以直接赋值给一个子集，这使得这里的成语简单得多，即使它需要很多子集：

df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])

结果与第一个版本相同。

所有版本只调用geocode两次。

请注意，虽然您可以用于purrr这项工作，但它并不比常规更适合dplyr。purrr擅长处理列表，虽然列表列是一种选择，但实际上并不需要对其进行操作。

score 3 · Accepted Answer

我不确定邻接purrr，但这里是使用管道的以下内容：

df <- frame_data(
  ~Street,                ~City,        ~State,     ~Zip,  ~lon,      ~lat,
  "226 W 46th St",        "New York",   "New York", 10036, -73.9867,  40.75902,
  "5th Ave",              "New York",   "New York", 10022, NA,        NA,
  "75 Broadway",          "New York",   "New York", 10006, -74.01205, 40.70814,
  "350 5th Ave",          "New York",   "New York", 10118, -73.98566, 40.74871,
  "20 Sagamore Hill Rd",  "Oyster Bay", "New York", 11771, NA,        NA,
  "45 Rockefeller Plaza", "New York",   "New York", 10111, -73.97771, 40.75915
)

df2<-df %>%
  filter(is.na(lon) | is.na(lat)) %>%
  group_by(Street, City, State) %>% #not really necessary but it suppresses a warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), 0)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), 0))

如果您想要上面示例代码中的部分输出：

as.data.frame(df2)[,5:6]
       lon       lat
1 40.77505 -73.96515
2 40.88259 -73.50538

或包括所有列：

as.data.frame(df2)
              Street       City    State   Zip      lon       lat
1             5th Ave   New York New York 10022 40.77505 -73.96515
2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538

如果您想将原始数据与新数据结合起来，您可以执行以下操作：

as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 ))
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2          75 Broadway   New York New York 10006 -74.01205  40.70814
3          350 5th Ave   New York New York 10118 -73.98566  40.74871
4 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915
5              5th Ave   New York New York 10022  40.77505 -73.96515
6  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515  40.77505

...或者您可以像下面这样简化它（保持原始顺序）：

df2<-df %>%
  #group_by(Street, City, State) %>% # unescape if you want to suppress warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), lon)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), lat))

as.data.frame(df2)
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2              5th Ave   New York New York 10022 -73.98670  40.75902
3          75 Broadway   New York New York 10006 -74.01205  40.70814
4          350 5th Ave   New York New York 10118 -73.98566  40.74871
5  20 Sagamore Hill Rd Oyster Bay New York 11771  40.75902 -73.98670
6 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915

score 0 · Accepted Answer

使用 dplyr：

df %>% mutate( lon = case_when( is.na(lon) ~ geocode(paste(Street, City, State, Zip))[,1],
                                TRUE       ~ lon),

               lat = case_when( is.na(lat) ~ geocode(paste(Street, City, State, Zip))[,2],
                                TRUE       ~ lat )
               )

r - Purrr-Fection：寻找利用 Purrr 的条件数据帧操作的优雅解决方案

3 回答 3

1.带有列表栏

2. 有do

3. 在一个子集上，用自连接重组

4.在一个子集上mutate_geocode

5.基础R

Related

Reference

2. 有`do`

4.在一个子集上`mutate_geocode`