我认为这里的第一个回答者有一个正确的想法,即最好的方法是使用字符串处理,而不是标记化和 NLP,如果在空格和字符位置上拆分的标记是您想要的输出。
如果您也确实想使用 tidy data 原则并最终得到一个数据框,请尝试以下方法:
library(tidyverse)
df <- data_frame(id=1,
doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] "))
df %>%
mutate(tokens = str_extract_all(doc, "([^\\s]+)"),
locations = str_locate_all(doc, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-doc) %>%
unnest(tokens, locations)
#> # A tibble: 11 x 4
#> id tokens start end
#> <dbl> <chr> <int> <int>
#> 1 1.00 Patient: 1 8
#> 2 1.00 [** 12 14
#> 3 1.00 Name 16 19
#> 4 1.00 **], 21 24
#> 5 1.00 [** 26 28
#> 6 1.00 Name 30 33
#> 7 1.00 **] 35 37
#> 8 1.00 Acct.#: 39 45
#> 9 1.00 [** 50 52
#> 10 1.00 Medical_Record_Number 54 74
#> 11 1.00 **] 76 78
这将适用id
于每个字符串都有列的多个文档,并且由于正则表达式的构造方式,它正在从输出中删除实际的空格。
已编辑:在评论中,原始发帖人要求提供一种允许按句子进行标记并跟踪每个单词位置的方法。下面的代码就是这样做的,从某种意义上说,我们获得了每个句子中每个标记的开始和结束位置。您可以使用该sentenceID
列与start
和end
列的组合来查找您要查找的内容吗?
library(tidyverse)
library(tidytext)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
d <- data_frame(txt = james)
d %>%
unnest_tokens(sentence, txt, token = "sentences") %>%
mutate(sentenceID = row_number(),
tokens = str_extract_all(sentence, "([^\\s]+)"),
locations = str_locate_all(sentence, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-sentence) %>%
unnest(tokens, locations)
#> # A tibble: 112 x 4
#> sentenceID tokens start end
#> <int> <chr> <int> <int>
#> 1 1 the 1 3
#> 2 1 question 5 12
#> 3 1 thus 14 17
#> 4 1 becomes 19 25
#> 5 1 a 27 27
#> 6 1 verbal 29 34
#> 7 1 one 36 38
#> 8 1 again; 40 45
#> 9 1 and 47 49
#> 10 1 our 51 53
#> # ... with 102 more rows
请注意,这些在正常意义上并不是完全“标记化”的unnest_tokens()
;他们仍然会在每个单词上附加结束标点符号,例如逗号和句点。看起来你想从你原来的问题中得到那个。