R 新手。我使用tidytext::unnest_tokens
以下方法将长文本分解为单个句子
tidy_drugs <- drugstext.raw %>%
unnest_tokens(sentence, Section, token="sentences")
因此,我得到了一个 data.frame,其中所有句子都转换为行。
我想获得从长文本中未嵌套的每个句子的开始和结束位置。
这是长文本文件的示例。它来自药品标签。
<< *6.1 Clinical Trial Experience
Because clinical trials are conducted under widely varying conditions, adverse reaction rates observed in clinical trials of a drug cannot be directly compared to rates in the clinical trials of another drug and may not reflect the rates observed in practice.
The data below reflect exposure to ARDECRETRIS as monotherapy in 327 patients with classical Hodgkin lymphoma (HL) and systemic anaplastic large cell lymphoma (sALCL), including 160 patients in two uncontrolled single-arm trials (Studies 1 and 2) and 167 patients in one placebo-controlled randomized trial (Study 3).
In Studies 1 and 2, the most common adverse reactions were neutropenia, fatigue, nausea, anemia, cough, and vomiting.*
所需的结果是具有三列的数据框