r - 如何通过n-gram对R中的pdf文件进行标记

Question

我想在 R 中通过 ngrams 标记 pdf 文档。我尝试按照https://www.tidytextmining.com/ngrams.html处的说明进行操作，但无法使用该unnest_tokens()功能。

library(tm)
library(dplyr)
library(tidytext)
library(tidyverse)


filedoc <- "Document2019.pdf"
cname <- file.path(filedoc)
docs <- Corpus(URISource(cname), readerControl=list(reader=readPDF, language = "en")) 

docs_bigrams <- docs %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

我不断收到此错误消息： Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "c('VCorpus', 'Corpus')"

在运行 unnest_tokens 函数之前我需要做些什么吗？谢谢你。

score 0 · Accepted Answer

我采用@phiver 的建议，使用 tidy 功能，并在此处重新发布答案，以便可以关闭/回答该线程。

“在 unnest_tokens 之前使用 tidy 函数。Tidytext 使用 tidy 函数从 tm 对象转换为 tibbles。”

谢谢！

r - 如何通过n-gram对R中的pdf文件进行标记

1 回答 1

Related

Reference