我有一个包含超过 2.300.000 个观察值的数据集。一个变量专门用于描述(文本),有时句子很长。有了所有这些观察,想象一下我们拥有的单词数量。我想获得一个包含这个变量的所有单词的输出(一个数据框),从最频繁到最不频繁排序。但是,我不想考虑诸如“and”、“street”、“the”等一些词。
我测试了两个代码:
descri1tm <- df %>%
# Transforming in a corpus #
VectorSource() %>%
Corpus() %>%
# Cleaning the corpus #
tm_map(content_transformer(tolower)) %>% #lowercase
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>% #numbers
tm_map(removePunctuation) %>% #ponctuation
tm_map(removeWords, stopwords("spanish", "cale","barrio","y","al","en","la","el","entre","del")) %>% # words we don't care about
# Transform in a Matrix #
TermDocumentMatrix() %>%
as.data.frame.matrix() %>%
mutate(name = row.names(.)) %>%
arrange(desc(`1`))
#Creating the data frame #
tidytext <- data_frame(line = 1:nrow(df), Description = df$cdescription)
#Frequency analysis
tidytext <- tidytext %>%
unnest_tokens(word, Description) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
head(tidytext, 10)
对于这个我认为它不够强大,R 运行了 24 小时没有结果......所以我测试了这个(在这里找到):
allwords <- df %>% stringr::str_glue_data("{rownames(.)} cdescription: {cdescription}")
# function to count words in a string #
countwords = function(strings){
# remove extra spaces between words
wr = gsub(pattern = " {2,}", replacement=" ", x=strings)
# remove line breaks
wn = gsub(pattern = '\n', replacement=" ", x=wr)
# remove punctuations
ws = gsub(pattern="[[:punct:]]", replacement="", x=wn)
# split the words
wsp = strsplit(ws, " ")
# sort words in table
wst = data.frame(sort(table(wsp, exclude=""), decreasing=TRUE))
wst
}
all_words <- countwords(allwords)
对于这一个,两个问题:不可能不考虑某些单词,并且我一次又一次收到以下错误消息:
Error in table(wsp, exclude = "") : all arguments must have the same length
Does someone have an idea ? Please be kind, it's my very first time with such a dataset, and data science is not my specialty at all !