r - R unnest_tokens 并计算每个令牌的位置（开始和结束位置）

Question

使用unnest_tokens后如何获取所有token的位置？这是一个简单的例子 -

df<-data.frame(id=1,
               doc=c("Patient:   [** Name **], [** Name **] Acct.#:         
[** Medical_Record_Number **]        MR #:     [** Medical_Record_Number **]
Location: [** Location **] "))

使用 tidytext 按空格进行标记 -

library(tidytext)
tokens_df<-df %>% 
unnest_tokens(tokens,doc,token = stringr::str_split, 
pattern = "\\s",
to_lower = F, drop = F)

如何获取所有代币的位置？

id  tokens  start  end
 1  Patient: 1      8
 1           9      9
 1  [**      12     14
 1  Name     16     19

score 0 · Accepted Answer

我认为这里的第一个回答者有一个正确的想法，即最好的方法是使用字符串处理，而不是标记化和 NLP，如果在空格和字符位置上拆分的标记是您想要的输出。

如果您也确实想使用 tidy data 原则并最终得到一个数据框，请尝试以下方法：

library(tidyverse)

df <- data_frame(id=1,
                 doc=c("Patient:   [** Name **], [** Name **] Acct.#:    [** Medical_Record_Number **]    "))

df %>%
  mutate(tokens = str_extract_all(doc, "([^\\s]+)"),
         locations = str_locate_all(doc, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-doc) %>%
  unnest(tokens, locations)

#> # A tibble: 11 x 4
#>       id tokens                start   end
#>    <dbl> <chr>                 <int> <int>
#>  1  1.00 Patient:                  1     8
#>  2  1.00 [**                      12    14
#>  3  1.00 Name                     16    19
#>  4  1.00 **],                     21    24
#>  5  1.00 [**                      26    28
#>  6  1.00 Name                     30    33
#>  7  1.00 **]                      35    37
#>  8  1.00 Acct.#:                  39    45
#>  9  1.00 [**                      50    52
#> 10  1.00 Medical_Record_Number    54    74
#> 11  1.00 **]                      76    78

这将适用id于每个字符串都有列的多个文档，并且由于正则表达式的构造方式，它正在从输出中删除实际的空格。

已编辑：在评论中，原始发帖人要求提供一种允许按句子进行标记并跟踪每个单词位置的方法。下面的代码就是这样做的，从某种意义上说，我们获得了每个句子中每个标记的开始和结束位置。您可以使用该sentenceID列与start和end列的组合来查找您要查找的内容吗？

library(tidyverse)
library(tidytext)

james <- paste0(
  "The question thus becomes a verbal one\n",
  "again; and our knowledge of all these early stages of thought and feeling\n",
  "is in any case so conjectural and imperfect that farther discussion would\n",
  "not be worth while.\n",
  "\n",
  "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
  "for us _the feelings, acts, and experiences of individual men in their\n",
  "solitude, so far as they apprehend themselves to stand in relation to\n",
  "whatever they may consider the divine_. Since the relation may be either\n",
  "moral, physical, or ritual, it is evident that out of religion in the\n",
  "sense in which we take it, theologies, philosophies, and ecclesiastical\n",
  "organizations may secondarily grow.\n"
)

d <- data_frame(txt = james)

d %>%
  unnest_tokens(sentence, txt, token = "sentences") %>%
  mutate(sentenceID = row_number(),
         tokens = str_extract_all(sentence, "([^\\s]+)"),
         locations = str_locate_all(sentence, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-sentence) %>%
  unnest(tokens, locations)

#> # A tibble: 112 x 4
#>    sentenceID tokens   start   end
#>         <int> <chr>    <int> <int>
#>  1          1 the          1     3
#>  2          1 question     5    12
#>  3          1 thus        14    17
#>  4          1 becomes     19    25
#>  5          1 a           27    27
#>  6          1 verbal      29    34
#>  7          1 one         36    38
#>  8          1 again;      40    45
#>  9          1 and         47    49
#> 10          1 our         51    53
#> # ... with 102 more rows

请注意，这些在正常意义上并不是完全“标记化”的unnest_tokens()；他们仍然会在每个单词上附加结束标点符号，例如逗号和句点。看起来你想从你原来的问题中得到那个。

score 0 · Accepted Answer

这是解决问题的不整洁的方法。

regex = "([^\\s]+)"
df_i = str_extract_all(df$doc, regex) 
df_ii = str_locate_all(df$doc, regex) 

output1 = Map(function(x, y, z){
  if(length(y) == 0){
    y = NA
  }
  if(nrow(z) == 0){
    z = rbind(z, list(start = NA, end = NA))
  }
  data.frame(id = x, token = y, z)
}, df$id, df_i, df_ii) %>%
  do.call(rbind,.) %>%
  merge(df, .)

r - R unnest_tokens 并计算每个令牌的位置（开始和结束位置）

2 回答 2

Related

Reference