r - R将语料库分成句子

Question

我有许多 PDF 文档，我已将它们读入带有 library 的语料库tm。如何将语料库分解成句子？
可以readLines通过sentSplit从包qdap[*] 读取文件来完成。该功能需要一个数据框。它还需要放弃语料库并单独读取所有文件。
如何将函数sentSplit{ qdap} 传递给语料库tm？或者，还有更好的方法？。

注意：sentDetect library 中有一个函数openNLP，现在是Maxent_Sent_Token_Annotator- 同样的问题适用：如何将其与语料库 [tm] 结合使用？

score 16 · Accepted Answer

我不知道如何重塑语料库，但这将是一个很棒的功能。

我想我的方法是这样的：

使用这些包

# Load Packages
require(tm)
require(NLP)
require(openNLP)

我会将我的文本设置为句子功能，如下所示：

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

还有我对重塑语料库函数的破解（注意：除非您以某种方式修改此函数并适当地复制它们，否则您将在这里丢失元属性）

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

其工作原理如下：

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

我的 sessionInfo 输出

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1

score 5 · Accepted Answer

openNLP发生了一些重大变化。坏消息是它看起来与以前大不相同。好消息是它更灵活，并且您之前喜欢的功能仍然存在，您只需找到它即可。

这会给你你所追求的：

?Maxent_Sent_Token_Annotator

只需通过示例，您就会看到您正在寻找的功能。

score 1 · Accepted Answer

这是一个基于此 Python 解决方案构建的函数，它允许一些灵活性，因为可以将前缀、后缀等列表修改为您的特定文本。它绝对不是完美的，但对正确的文本可能很有用。

caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)\\."
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
starters = "(Mr|Mrs|Ms|Dr|He\\s|She\\s|It\\s|They\\s|Their\\s|Our\\s|We\\s|But\\s|However\\s|That\\s|This\\s|Wherever)"
websites = "\\.(com|edu|gov|io|me|net|org)"
digits = "([0-9])"

split_into_sentences <- function(text){
  text = gsub("\n|\r\n"," ", text)
  text = gsub(prefixes, "\\1<prd>", text)
  text = gsub(websites, "<prd>\\1", text)
  text = gsub('www\\.', "www<prd>", text)
  text = gsub("Ph.D.","Ph<prd>D<prd>", text)
  text = gsub(paste0("\\s", caps, "\\. "), " \\1<prd> ", text)
  text = gsub(paste0(acronyms, " ", starters), "\\1<stop> \\2", text)
  text = gsub(paste0(caps, "\\.", caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>\\3<prd>", text)
  text = gsub(paste0(caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>", text)
  text = gsub(paste0(" ", suffixes, "\\. ", starters), " \\1<stop> \\2", text)
  text = gsub(paste0(" ", suffixes, "\\."), " \\1<prd>", text)
  text = gsub(paste0(" ", caps, "\\."), " \\1<prd>",text)
  text = gsub(paste0(digits, "\\.", digits), "\\1<prd>\\2", text)
  text = gsub("...", "<prd><prd><prd>", text, fixed = TRUE)
  text = gsub('\\.”', '”.', text)
  text = gsub('\\."', '\".', text)
  text = gsub('\\!"', '"!', text)
  text = gsub('\\?"', '"?', text)
  text = gsub('\\.', '.<stop>', text)
  text = gsub('\\?', '?<stop>', text)
  text = gsub('\\!', '!<stop>', text)
  text = gsub('<prd>', '.', text)
  sentence = strsplit(text, "<stop>\\s*")
  return(sentence)
}

test_text <- 'Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years. He earned $2.5 million when it sold! Now he works at www.website.com.'
sentences <- split_into_sentences(test_text)
names(sentences) <- 'sentence'
df_sentences <- dplyr::bind_rows(sentences) 

df_sentences
# A tibble: 3 x 1
sentence                                                     
<chr>                                                        
1 Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years.
2 He earned $2.5 million when it sold!                         
3 Now he works at www.website.com.

score 1 · Accepted Answer

只需将您的语料库转换为数据框并使用正则表达式来检测句子。

这是一个使用正则表达式检测段落中的句子并返回每个单独句子的函数。

chunk_into_sentences <- function(text) {
      break_points <- c(1, as.numeric(gregexpr('[[:alnum:] ][.!?]', text)[[1]]) + 1)
      sentences <- NULL
      for(i in 1:length(break_points)) {
        res <- substr(text, break_points[i], break_points[i+1]) 
        if(i>1) { sentences[i] <- sub('. ', '', res) } else { sentences[i] <- res }
      }
      sentences <- sentences[sentences=!is.na(sentences)]
      return(sentences)
    }

...在 tm 包的语料库中使用一个段落。

text <- paste('Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.')
mycorpus <- VCorpus(VectorSource(text))
corpus_frame <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=F)

使用如下：

chunk_into_sentences(corpus_frame)

这给了我们：

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                       
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

现在有了更大的语料库

text1 <- "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
text2 <- "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like)."
text3 <- "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."
text_list <- list(text1, text2, text3)
my_big_corpus <- VCorpus(VectorSource(text_list))

使用如下：

lapply(my_big_corpus, chunk_into_sentences)

这给了我们：

$`1`
[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                      
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

$`2`
[1] "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout."                                                             
[2] "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English."     
[3] "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."

$`3`
[1] "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable."
[2] "If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text."                                                                     
[3] "All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet."                                                       
[4] "It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable."                                                       
[5] "The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."

score 0 · Accepted Answer

使用qdap 1.1.0 版，您可以通过以下方式完成此操作（我使用了@Tony Breyal 的current.corpus数据集）：

library(qdap)
with(sentSplit(tm_corpus2df(current.corpus), "text"), df2tm_corpus(tot, text))

你也可以这样做：

tm_map(current.corpus, sent_detect)


## inspect(tm_map(current.corpus, sent_detect))

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $doc1
## [1] Doctor Who is a British science fiction television programme produced by the BBC.                                                                     
## [2] The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor.                                            
## [3] He explores the universe in his TARDIS, a sentient time-travelling space ship.                                                                        
## [4] Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired.                                    
## [5] Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.
## 
## $doc2
## [1] The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.
## [2] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor.                                                                                                                                                                                                       
## [3] In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody for evolving with technology and the times like nothing else in the known television universe.                                                                                                                                   
## 
## $doc3
## [1] The programme is listed in Guinness World Records as the longest-running science fiction television show in the world and as the most successful science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.
## [2] During its original run, it was recognised for its imaginative stor

score 0 · Accepted Answer

我使用包实现了以下代码来解决相同的问题tokenizers。

# Iterate a list or vector of strings and split into sentences where there are
# periods or question marks
sentences = purrr::map(.x = textList, function(x) {
  return(tokenizers::tokenize_sentences(x))
})

# The code above will return a list of character vectors so unlist
# to give you a character vector of all the sentences
sentences = unlist(sentences)

# Create a corpus from the sentences
corpus = VCorpus(VectorSource(sentences))

score -2 · Accepted Answer

该错误与 ggplot2 包有关，并且 annotate 函数给出了此错误，请分离 ggplot2 包，然后重试。希望它应该工作。

r - R将语料库分成句子

7 回答 7

Related

Reference