我是 Spark ML 的新手。Spark ML 具有 Jaccard 距离的 MinHash 实现。请参阅文档https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance。在示例代码中,用于比较的输入数据来自向量。我对示例代码没有任何疑问。但是当我使用文本文档作为输入,然后通过 word2Vec 将它们转换为向量时,我得到了 0 jaccard 距离。不知道我的代码有什么问题。我不明白的东西。提前感谢您的帮助。
SparkSession spark = SparkSession.builder().appName("TestMinHashLSH").config("spark.master", "local").getOrCreate();
List<Row> data1 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" "))));
List<Row> data2 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Scala".split(" "))),
RowFactory.create(Arrays.asList("I wish python could also use case classes".split(" "))));
StructType schema4word = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) });
Dataset<Row> documentDF1 = spark.createDataFrame(data1, schema4word);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(30).setMinCount(0);
Word2VecModel w2vModel1 = word2Vec.fit(documentDF1);
Dataset<Row> result1 = w2vModel1.transform(documentDF1);
List<Row> myDataList1 = new ArrayList<>();
int id = 0;
for (Row row : result1.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
myDataList1.add(RowFactory.create(id++, vector));
}
StructType schema1 = new StructType(
new StructField[] { new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) });
Dataset<Row> df1 = spark.createDataFrame(myDataList1, schema1);
Dataset<Row> documentDF2 = spark.createDataFrame(data2, schema4word);
Word2VecModel w2vModel2 = word2Vec.fit(documentDF2);
Dataset<Row> result2 = w2vModel2.transform(documentDF2);
List<Row> myDataList2 = new ArrayList<>();
id = 10;
for (Row row : result2.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
myDataList2.add(RowFactory.create(id++, vector));
}
Dataset<Row> df2 = spark.createDataFrame(myDataList2, schema1);
MinHashLSH mh = new MinHashLSH().setNumHashTables(5).setInputCol("features").setOutputCol("hashes");
MinHashLSHModel model = mh.fit(df1);
// Feature Transformation
System.out.println("The hashed dataset where hashed values are stored in the column 'hashes':");
model.transform(df1).show();
// Compute the locality sensitive hashes for the input rows, then perform
// approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed
// dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
System.out.println("Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:");
model.approxSimilarityJoin(df1, df2, 1.6, "JaccardDistance")
.select(col("datasetA.id").alias("id1"), col("datasetB.id").alias("id2"), col("JaccardDistance"))
.show();
// $example off$
spark.stop();
从 Word2Vec,我得到了不同文档的不同向量。在比较两个不同的文档时,我希望得到一些 JaccardDistance 的非零值。但相反,我得到了全 0。下面显示了我运行程序时得到的结果:
文本:[嗨,我,听说过,关于 Scala] => 向量:[0.005808539432473481,-0.001387741044163704,0.007890049391426146,... ,04969391227]
文本:[I, wish, python, could, also, use, case, classes] => 向量:[-0.0022146602132124826,0.0032128597667906433,-0.00658524181926623,...,-3.716901264851913E-4]
在 Jaccard 距离小于 0.6 时近似加入 df1 和 df2:+---+---+---------------+ |id1|id2|JaccardDistance| +---+---+---------------+ | 1| 11| 0.0| | 0| 10| 0.0| | 2| 11| 0.0| | 0| 11| 0.0| | 1| 10| 0.0| | 2| 10| 0.0| +---+---+---------------+