r - 有没有更快的方法来加入/连接 R 中的两个标记？

问问题 2017-12-30T23:26:39.083

620 次

我正在处理 EMR 数据。医疗记录中的许多实体被分成两个不同的词（例如 - CT Scan），但我计划使用下划线（CT_Scan）将这些标记连接到一个词中。有没有更快的方法在庞大的语料库上执行此任务。我的方法是使用“quanteda”包。这是代码片段 -

# Sample text
    mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised taxes: an income tax and inheritance taxes.")

# Tokenize by white space 
    library(quanteda)
    mytoks <- tokens(mytexts, remove_punct = TRUE)

# list of tokens that need to be joined 
    myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))

# New list that includes concatenated tokens
        clean_toks <- tokens_compound(mytoks, myseqs)

这项任务是在大约 30 亿个令牌上执行的，“compound_token”函数花费了大量时间（>12 小时）。有没有更好的方法来解决这个问题？

r - 有没有更快的方法来加入/连接 R 中的两个标记？

0 回答 0

Related

Reference