r - 带有句子开始和结束位置的 R unnest

Question

R 新手。我使用tidytext::unnest_tokens以下方法将长文本分解为单个句子

tidy_drugs <- drugstext.raw %>% unnest_tokens(sentence, Section, token="sentences")

因此，我得到了一个 data.frame，其中所有句子都转换为行。

我想获得从长文本中未嵌套的每个句子的开始和结束位置。

这是长文本文件的示例。它来自药品标签。

<< *6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
 The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
 In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting.*

所需的结果是具有三列的数据框

数据框

score 1 · Accepted Answer

您可以使用str_locatefrom执行此操作stringr。这很烦人，因为换行符和特殊字符会弄乱您搜索的正则表达式。在这里，我们首先使用删除输入文本中的换行符str_replace_all，然后取消嵌套标记以确保保留原始文本并防止大小写更改。然后，我们创建一个新的正则表达式列，用正确转义的版本替换特殊字符（此处为、和），并用于(添加)每个字符串的开头和结尾。.str_locate

我没有得到与您相同的数字，但我从您的代码中复制了文本，该代码并不总是保留所有字符，并且您的最终数字无论如何end都小于。start

library(tidyverse)
library(tidytext)

raw_text <- tibble(section = "6.1 Clinical Trial Experience
  Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
                   The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
                   In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting."
)

tidy_text <- raw_text %>%
  mutate(section = str_replace_all(section, "\\n", "")) %>%
  unnest_tokens(
    output = sentence,
    input = section,
    token = "sentences",
    drop = FALSE,
    to_lower = FALSE
    ) %>%
  mutate(
    regex = str_replace_all(sentence, "\\(", "\\\\("),
    regex = str_replace_all(regex, "\\)", "\\\\)"),
    regex = str_replace_all(regex, "\\.", "\\\\.")
  ) %>%
  mutate(
    start = str_locate(section, regex)[, 1],
    end = str_locate(section, regex)[, 2]
  ) %>%
  select(sentence, start, end) %>%
  print()
#> # A tibble: 3 x 3
#>   sentence                                                     start   end
#>   <chr>                                                        <int> <int>
#> 1 6.1 Clinical Trial Experience  Because clinical trials are ~     1   290
#> 2 The data below reflect exposure to ARDECRETRIS as monothera~   310   626
#> 3 In Studies 1 and 2, the most common adverse reactions were ~   646   762

由reprex 包（v0.2.0）于 2018 年 2 月 23 日创建。

r - 带有句子开始和结束位置的 R unnest

1 回答 1

Related

Reference