r - str_match 基于带有计数问题的向量

Question

我没有reprex，但我的数据存储在一个csv文件中

https://transcode.geo.data.gouv.fr/services/5e2a1fbefa4268bc25628f27/feature-types/drac:site?format=CSV&projection=WGS84

library(readr)
bzh_sites <- read_csv("site.csv")

我想根据字符匹配来计算行数（自然列）

pattern<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")


 test2 <-  bzh_sites %>%
  drop_na(NATURE) %>%
   group_by(NATURE = str_match( NATURE, pattern )) %>%
   summarise(n = n())

给我：

NATURE  n
1   allée   176
2   aqueduc 73
3   architecture    68
4   atelier 200

和另一个具有相同数据的测试（自然）

pattern <- c("allée|aqueduc|architecture|atelier")

test2 <-  bzh_sites %>%
 drop_na(NATURE) %>%
 group_by(NATURE = str_match( NATURE, pattern )) %>%
 summarise(n = n())

给我：

NATURE    n
1   allée   178
2   aqueduc 74
3   architecture    79
4   atelier 248

我不知道计数的差异。

score 1 · Accepted Answer

我试图找出第一组的差异在哪里，即"allée"。这就是我发现的：

library(stringr)

pattern1<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")

#Get indices where 'allée' is found using pattern1
ind1 <- which(str_match(bzh_sites$NATURE, pattern1 )[, 1] == 'allée')

pattern2 <- c("allée|aqueduc|architecture|atelier")
#Get indices where 'allée' is found using pattern1
ind2 <- which(str_match(bzh_sites$NATURE, pattern2)[, 1] == 'allée')

#Indices which are present in ind2 but absent in ind1
setdiff(ind2, ind1)
#[1]  3093 10400

#Get corresponding text
temp <- bzh_sites$NATURE[setdiff(ind2, ind1)]
temp
#[1] "dolmen allée couverte"           "coffre funéraire allée couverte"

当我们使用时会pattern1发生pattern2什么temp

str_match(temp, pattern1)
#         [,1]    
#[1,] "dolmen"
#[2,] "coffre"

str_match(temp, pattern2)

#       [,1]   
#[1,] "allée"
#[2,] "allée"

正如我们所看到的，使用pattern1某些值被分类在另一个组中，因为它们首先出现在字符串中，因此我们有一个不匹配。

对于其他组中的不匹配，可以给出类似的解释。

str_match仅返回第一个匹配项，以获取我们可以使用的模式中的所有匹配项str_match_all

table(unlist(str_match_all(bzh_sites$NATURE, pattern1)))

#       allée      aqueduc architecture      atelier          bas 
#         178           76           79          252           62 
#    carrière       caveau     chapelle      château       chemin 
#          46           35          226          205          350 
#   cimetière       coffre        dépôt       dolmen          eau 
#         275          155          450          542          114 
#      église     enceinte       enclos       éperon       space 
#         360          655          338          114          102 
#exploitation        fanum        ferme    funéraire       groups 
#        1856           38          196         1256          295 
#     habitat       maison       manoir       menhir    monastère 
#        1154           65          161         1036           31 
#       motte    nécropole   occupation organisation  parcellaire 
#         566          312         5152           50          492 
#    pêcherie      prieuré   production          rue    sépulture 
#          69           66          334           44          152 
#       stèle      thermes   traitement      tumulus        villa 
#         651           50          119         1232          225

r - str_match 基于带有计数问题的向量

1 回答 1

Related

Reference