r - 成对距离计算嵌套数据框

Question

我正在寻找一种方法来以成对的方式计算点之间的分离距离，并将每个单独点的结果存储在随附的嵌套数据框中。

例如，我有这个数据框（来自 maps 包），其中包含有关我们城市的信息，包括它们的物理位置。我已丢弃其余信息并将坐标嵌套在嵌套数据框中。我打算distHaversine()从geosphere包中使用来计算这些距离。

library(tidyverse)

df <- maps::us.cities %>% 
  slice(1:20) %>% 
  group_by(name) %>% 
  nest(long, lat, .key = coords)

                   name            coords
                  <chr>           <list>
 1           Abilene TX <tibble [1 x 2]>
 2             Akron OH <tibble [1 x 2]>
 3           Alameda CA <tibble [1 x 2]>
 4            Albany GA <tibble [1 x 2]>
 5            Albany NY <tibble [1 x 2]>
 ...(With 15 more rows)

我已经研究过将 map 系列函数与 mutate 结合使用，但我遇到了困难。所需结果的形式如下：

                   name            coords            sep_dist
                  <chr>           <list>            <list>
 1           Abilene TX <tibble [1 x 2]> <tibble [19 x 2]>
 2             Akron OH <tibble [1 x 2]> <tibble [19 x 2]>
 3           Alameda CA <tibble [1 x 2]> <tibble [19 x 2]>
 4            Albany GA <tibble [1 x 2]> <tibble [19 x 2]>
 5            Albany NY <tibble [1 x 2]> <tibble [19 x 2]>
 ...(With 15 more rows)

sep_dist 小标题看起来像这样：

               location  distance
                  <chr>     <dbl> 
 1             Akron OH      1003
 2           Alameda CA       428
 3            Albany GA      3218
 4            Albany NY      3627
 5            Albany OR        97
 ...(With 14 more rows)                       -distances completely made up

其中 location 是与名称进行比较的点（在本例中为 Abilene）。

score 3 · Accepted Answer

geosphere提供比较所有距离的能力distm

可重现的数据

set.seed(1)
df <- data.frame(name=letters[1:4],
                 lon=runif(4)*10,
                 lat=runif(4)*10)

distm

library(geosphere)
ans <- as.data.frame(distm(df[,2:3], df[,2:3], fun=distHaversine))

         # a        b        c        d
# 1      0.0 784506.1 894320.6 877440.5
# 2 784506.1      0.0 226504.3 647666.7
# 3 894320.6 226504.3      0.0 486290.8
# 4 877440.5 647666.7 486290.8      0.0

整理成想要的格式

colnames(ans) <- df$name
library(dplyr)
library(tidyr)
desired <- ans %>%
             gather(pos1, distance) %>%
             mutate(pos2 = rep(df$name, nrow(df))) %>%
             filter(pos1!=pos2) %>%
             select(pos1, pos2, distance)

   # pos1 pos2 distance
# 1     a    b 784506.1
# 2     a    c 894320.6
# 3     a    d 877440.5
# 4     b    a 784506.1
# 5     b    c 226504.3
# 6     b    d 647666.7
# 7     c    a 894320.6
# 8     c    b 226504.3
# 9     c    d 486290.8
# 10    d    a 877440.5
# 11    d    b 647666.7
# 12    d    c 486290.8

score 2 · Accepted Answer

我们可以用位置名称和坐标的所有组合扩展一个“网格”，但删除具有相同位置名称的组合。之后，使用map2_dbl来应用该distHaversine功能。

library(tidyverse)
library(geosphere)

df2 <- df %>%
  # Create the grid
  mutate(name1 = name) %>%
  select(starts_with("name")) %>%
  complete(name, name1) %>%
  filter(name != name1) %>%
  left_join(df, by = "name") %>%
  left_join(df, by = c("name1" = "name")) %>%
  # Grid completed. Calcualte the distance by distHaversine
  mutate(distance = map2_dbl(coords.x, coords.y, distHaversine))

df2
# A tibble: 380 x 5
         name          name1         coords.x         coords.y  distance
        <chr>          <chr>           <list>           <list>     <dbl>
 1 Abilene TX       Akron OH <tibble [1 x 2]> <tibble [1 x 2]> 1881904.4
 2 Abilene TX     Alameda CA <tibble [1 x 2]> <tibble [1 x 2]> 2128576.9
 3 Abilene TX      Albany GA <tibble [1 x 2]> <tibble [1 x 2]> 1470577.2
 4 Abilene TX      Albany NY <tibble [1 x 2]> <tibble [1 x 2]> 2542025.1
 5 Abilene TX      Albany OR <tibble [1 x 2]> <tibble [1 x 2]> 2429367.3
 6 Abilene TX Albuquerque NM <tibble [1 x 2]> <tibble [1 x 2]>  702287.5
 7 Abilene TX  Alexandria LA <tibble [1 x 2]> <tibble [1 x 2]>  700093.2
 8 Abilene TX  Alexandria VA <tibble [1 x 2]> <tibble [1 x 2]> 2161594.6
 9 Abilene TX    Alhambra CA <tibble [1 x 2]> <tibble [1 x 2]> 1718967.5
10 Abilene TX Aliso Viejo CA <tibble [1 x 2]> <tibble [1 x 2]> 1681868.8
# ... with 370 more rows

要创建最终输出，我们可以group_by基于名称和nest所有其他所需的列。

df3 <- df2 %>%
  select(-starts_with("coord")) %>%
  group_by(name) %>%
  nest()

df3
# A tibble: 20 x 2
                   name              data
                  <chr>            <list>
 1           Abilene TX <tibble [19 x 2]>
 2             Akron OH <tibble [19 x 2]>
 3           Alameda CA <tibble [19 x 2]>
 4            Albany GA <tibble [19 x 2]>
 5            Albany NY <tibble [19 x 2]>
 6            Albany OR <tibble [19 x 2]>
 7       Albuquerque NM <tibble [19 x 2]>
 8        Alexandria LA <tibble [19 x 2]>
 9        Alexandria VA <tibble [19 x 2]>
10          Alhambra CA <tibble [19 x 2]>
11       Aliso Viejo CA <tibble [19 x 2]>
12             Allen TX <tibble [19 x 2]>
13         Allentown PA <tibble [19 x 2]>
14             Aloha OR <tibble [19 x 2]>
15          Altadena CA <tibble [19 x 2]>
16 Altamonte Springs FL <tibble [19 x 2]>
17           Altoona PA <tibble [19 x 2]>
18          Amarillo TX <tibble [19 x 2]>
19              Ames IA <tibble [19 x 2]>
20           Anaheim CA <tibble [19 x 2]>

而data现在的每个数据框都是这样的。

df3$data[[1]]
# A tibble: 19 x 2
                  name1  distance
                  <chr>     <dbl>
 1             Akron OH 1881904.4
 2           Alameda CA 2128576.9
 3            Albany GA 1470577.2
 4            Albany NY 2542025.1
 5            Albany OR 2429367.3
 6       Albuquerque NM  702287.5
 7        Alexandria LA  700093.2
 8        Alexandria VA 2161594.6
 9          Alhambra CA 1718967.5
10       Aliso Viejo CA 1681868.8
11             Allen TX  296560.4
12         Allentown PA 2342363.5
13             Aloha OR 2457938.8
14          Altadena CA 1719207.6
15 Altamonte Springs FL 1805480.9
16           Altoona PA 2102993.0
17          Amarillo TX  361520.0
18              Ames IA 1194234.7
19           Anaheim CA 1694698.9

r - 成对距离计算嵌套数据框

2 回答 2

Related

Reference