0

this is quite long, and I am sorry about this.

I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this:

+--------+------+------+------+------+
|element | doc0 | doc1 | doc2 | doc3 |
+--------+------+------+------+------+
|   d    |   1  |  0   |  1   |  1   |
|   c    |   0  |  1   |  0   |  1   |
|   a    |   1  |  0   |  0   |  1   |
|   b    |   0  |  0   |  1   |  0   |
|   e    |   0  |  0   |  1   |  0   |  
+--------+------+------+------+------+

the goal is to identify, among these four documents (doc0,doc1,doc2 and doc3), which documents are similar to each other. And obviously, the only possible candidate pair would be doc0 and doc3.

Using Spark's support, generating the following "characteristic matrix" is as far as I can reach at this point:

+----+---------+-------------------------+
|key |value    |vector                   |
+----+---------+-------------------------+
|key0|[a, d]   |(5,[0,2],[1.0,1.0])      |
|key1|[c]      |(5,[1],[1.0])            |
|key2|[b, d, e]|(5,[0,3,4],[1.0,1.0,1.0])|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1.0,1.0])|
+----+---------+-------------------------+

and here is the code snippets:

CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(false);
Dataset<Row> matrixDoc = vectorizer.fit(df).transform(df);

MinHashLSH mh = new MinHashLSH()
  .setNumHashTables(5)
  .setInputCol("vector")
  .setOutputCol("hashes");

MinHashLSHModel model = mh.fit(matrixDoc);

Now, there seems to be two main calls on the MinHashLSHModel model that one can use: model.approxSimilarityJoin(...) and model.approxNearestNeighbors(...). Examples about using these two calls are here: https://spark.apache.org/docs/latest/ml-features.html#lsh-algorithms

On the other hand, model.approxSimilarityJoin(...) requires us to join two datasets, and I have only one dataset which has 4 documents and I would like to figure out which ones in these four are similar to each other, so I don't have a second dataset to join... Just to try it out, I actually joined my only dataset with itself. Based on the result, seems like model.approxSimilarityJoin(...) just did a pair-wise Jaccard calculation, and I don't see any impact by changing the number of Hash functions etc, left me wondering about where exactly the minhash signature was calculated and where the band/row partition has happened...

The other call, model.approxNearestNeighbors(...), actually asks a comparison point, and then the model will identify the nearest neighbor(s) to this given point... Obviously, this is not what I wanted either, since I have four toy documents, and I don't have an extra reference point.

I am running out of ideas, so I went ahead implemented my own version of the algorithm, using Spark APIs, but not much support from MinHashLSHModel model, which really made me feel bad. I am thinking I must have missed something... ??

I would love to hear any thoughts, really wish to solve the mystery.

Thank you guys in advance!

4

1 回答 1

1
  • minHash 签名计算 model.approxSimilarityJoin(...)本身发生在model.transform(...) 每个输入数据集上调用函数并在加入它们之前计算哈希签名并进行成对的 jaccard 距离计算。因此,可以在此处看到更改哈希函数数量的影响。
  • 在中,在使用在输入数据集上调用的函数model.approxNearestNeighbors(...)创建模型时,可以看到同样的影响 。minHash.fit(...)transform(...)
于 2018-02-23T06:20:07.583 回答