algorithm - 在 K-Means 聚类中使用词干提取

Question

我正在尝试实现K-Means 算法并且对向量部分感到困惑。

这就是我所做的：

对于每个文档，我为其中的每个单词生成 tf-idf 并将其存储在 STL 映射中。然后将余弦相似度用于算法与实际单词。

我应该在哪里使用词干部分？

我应该先词干并计算词干的 tf-idf 吗？

我应该只对算法使用词干词吗？

使用词干不会降低结果吗？

score 1 · Accepted Answer

Usually, stemming is done before actually computing the tf-idf for each stem.

Then for your last two questions, I'd say it depends on what you're doing. You should try different method (stemming, raw words, lemmatization) and see what method yields the best results.

In the case of clustering, you should use an set of annotated documents, try your methods on it, and then establish for each method its confusion matrix, which will help you determine the best method for your problem.

score 1 · Accepted Answer

这取决于您的聚类目标是什么。
在我们曾经做过的一个项目中，我们需要提供两个字符串之间的匹配分数，其中可能会出现单词的变化。我们首先进行词干，然后计算字符串之间匹配的单词数。如果这种类型的匹配对您的问题有意义，那么首先进行词干提取可能是一个好主意。
当然，当你停止时你会丢失信息，但你获得了减少一些噪音的能力。

algorithm - 在 K-Means 聚类中使用词干提取

2 回答 2

Related

Reference