当我尝试在我的语料库上使用 tf-idf 时遇到了一个奇怪的问题。
这是我的代码:
prep_fun <- function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("<.*?>", " ") %>%
str_replace_all("[^a-zA-Z0-9[:punct:]]", " ") %>%
str_replace_all("(f|ht)tp(s?)://(.*)[.][a-z]+", " ") %>%
str_replace_all("\\(", " ") %>%
str_replace_all("\\)", " ") %>%
str_replace_all("§", " ") %>%
str_replace_all(" \\. ", " ") %>%
str_replace_all("[\\.;:,+-] ", " ") %>%
str_replace_all("/", " ") %>%
#remove tags
#remove standalone numbers
str_replace_all("\\s*(?<!\\B|-)\\d+(?!\\B|-)\\s*", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
stem_tokenizer = function(x) {
word_tokenizer(x) %>% lapply( function(x) SnowballC::wordStem(x, language="en"))
}
it = itoken(as.character(train$text),
preprocessor = prep_fun,
progressbar = T,
tokenizer = stem_tokenizer,
ids = train$id)
v = create_vocabulary(it, stopwords = c(stopwords("english"), stopwords("SMART"))) %>% prune_vocabulary(term_count_min = 3)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm, tfidf)
当我运行它时,它在 fit_transform 部分失败,并显示以下消息:
'names' 属性 [90214] 必须与向量 [10] 的长度相同
有没有人遇到过这样的问题?
谢谢!
更新:我对电影评论数据集做了同样的事情:
it <- itoken(movie_review$review, prep_fun, stem_tokenizer, ids = movie_review$id)
v = create_vocabulary(it, stopwords = c(stopwords("english"), stopwords("SMART"))) %>% prune_vocabulary(term_count_min = 3)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm, tfidf)
.local(x, na.rm, dims, ...) 中的错误:“名称”属性 [5000] 必须与向量 [10] 的长度相同