scala - 通过使用稀疏矩阵而不是密集矩阵来应用 LSH 方法

Question

我尝试应用 LSH（https://github.com/soundcloud/cosine-lsh-join-spark）来计算某些向量的余弦相似度。对于我的真实数据，我有 2M 行（文档）和属于它们的 30K 特征。此外，该矩阵非常稀疏。举个例子，假设我的数据如下：

D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ...

在相关代码中，特征被放在一个密集向量中，如下所示：

val input = "text.txt"
    val conf = new SparkConf()
      .setAppName("LSH-Cosine")
      .setMaster("local[4]")
    val storageLevel = StorageLevel.MEMORY_AND_DISK
    val sc = new SparkContext(conf)

    // read in an example data set of word embeddings
    val data = sc.textFile(input, numPartitions).map {
      line =>
        val split = line.split(" ")
        val word = split.head
        val features = split.tail.map(_.toDouble)
        (word, features)
    }

    // create an unique id for each word by zipping with the RDD index
    val indexed = data.zipWithIndex.persist(storageLevel)

    // create indexed row matrix where every row represents one word
    val rows = indexed.map {
      case ((word, features), index) =>
        IndexedRow(index, Vectors.dense(features))
    }

我想要做的是使用稀疏矩阵而不是使用密集矩阵。如何调整“Vectors.dense(features)”？

score 0 · Accepted Answer

稀疏向量的等效工厂方法是Vectors.sparse，它需要一个索引数组和一个对应的非零条目值数组。cosine-lsh-join-spark 库中的方法签名基于通用 Vector 类，因此该库似乎将接受稀疏或密集向量。

scala - 通过使用稀疏矩阵而不是密集矩阵来应用 LSH 方法

1 回答 1

Related

Reference