java - HashingTF 没有给出唯一索引

Question

我正在使用 eclipse Mars、java 8 和 spark-spark-assembly-1.6.1-hadoop2.4.0.jar 实现潜在语义分析 LSA，我将文档作为令牌传递，然后得到 SVD 等等

   HashingTF hf = new HashingTF(hashingTFSize);
    JavaRDD<Vector> ArticlesAsV = hf.transform(articles.map(x->x.tokens));
  JavaRDD<Vector> ArticlesTFIDF = idf.fit(ArticlesAsV).transform(ArticlesAsV);
RowMatrix matTFIDF = new RowMatrix(ArticlesTFIDF.rdd());
   double rCond= 1.0E-9d;
    int k = 50;
    SingularValueDecomposition<RowMatrix, Matrix> svd =  matTFIDF.computeSVD(k, true, rCond);

每件事都完美无缺，除了一个，那就是当我尝试从 hashingTF 中获取术语的索引时

int index = hf.indexOf(term);

我发现有许多具有相同索引的术语，这些是我得到的

0：术语
1：全部
1：下一个
2：tt
3：
7：文档
9：这样
9：矩阵
11：文档
11：大约
11：每个
12：函数
12：机会
14：这个
14：提供
意味着，当我尝试用它来获取术语的向量，我可能会得到另一个具有相同索引的向量，我在词形还原和删除停用词之后做了它，但仍然得到相同的错误，有什么我错过了，或需要更新的组件（例如 MLip）出错；我怎样才能为每个术语保持唯一性。

score 2 · Accepted Answer

Spark 类HashingTF 利用散列技巧。

通过应用哈希函数将原始特征映射到索引（术语）。然后根据映射的索引计算词频。这种方法避免了计算全局术语到索引映射的需要，这对于大型语料库来说可能是昂贵的，但它会遭受潜在的哈希冲突，其中不同的原始特征可能在哈希后变成同一个术语。为了减少碰撞的机会，我们可以增加目标特征维度，即哈希表的桶数。默认特征维度为 2^20=1,048,576。

因此，术语组可以具有相同的索引。

相对于下面的评论，如果您需要所有术语，您可以使用CountVectorizer而不是HashingTF。CountVectorizer 也可用于获取词频向量。要使用CountVectorizer和随后的IDF ，您必须使用 DataFrame 而不是 JavaRDD，因为 CountVectorizer 仅在ml包中受支持。

这是一个带有列id和words的 DataFrame 示例：

id | words
---|----------  
0  | Array("word1", "word2", "word3")  
1  | Array("word1", "word2", "word2", "word3", "word1")

因此，如果您将文章JavaRDD 转换为 DataFrame，其中包含列id和words，其中每一行是句子或文档中的一袋单词，您可以使用如下代码计算TfIdf：

CountVectorizerModel cvModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .setVocabSize(100000) // <-- Specify the Max size of the vocabulary.
  .setMinDF(2) // Specifies the minimum number of different documents a term must appear in to be included in the vocabulary.
  .fit(df); 

  DataFrame featurizedData = cvModel.transform(articles);

  IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
  IDFModel idfModel = idf.fit(featurizedData);

java - HashingTF 没有给出唯一索引

1 回答 1

Related

Reference