我想使用 pyspark 中的数据帧中的值构造一个距离矩阵。我现在拥有的是
+----+-------------+
| id | list |
+----+-------------+
| 1 | [a, b, ...] |
+----+-------------+
| 2 | [c, d, ...] |
+----+-------------+
| 3 | [e, f, ...] |
+----+-------------+
我想使用我自己的距离函数并做类似的事情
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
dist = calculate_distance(features[i], features[j])
add_row_to_distance_df([ids[i], ids[j], dist])
编辑:预期输出是
+-----+-----+-----------------------------+
| id1 | id2 | dist |
+-----+-----+-----------------------------+
| 1 | 2 | d([a, b, ...], [c, d, ...]) |
+-----+-----+-----------------------------+
| 1 | 3 | d([a, b, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
| 2 | 3 | d([c, d, ...], [e, f, ...]) |
+-----+-----+-----------------------------+
我该怎么做呢?