string - Clustering string data with ELKI

Question

I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I

(a) load string data in ELKI from a file (only "Labels")?

(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)

Some code snippets or example input files would be helpful.

score 1 · Accepted Answer

它实际上非常简单：

A）编写一个Parser足以满足您的输入文件格式的文件（为什么要尝试重用为带有标签的数值向量编写的解析器？），可能是子类AbstractStreamingParser化，产生所需数据类型的关系（可能您可以只使用String。如果你想对于这些距离来说，更通用一点TokenSequence可能是更合适的概念。字符串只是最简单的情况。

B )DistanceFunction基于此向量类型而不是 DBID 实现 a ，即 a PrimitiveDistanceFunction<String>。同样，子类AbstractPrimitiveDistanceFunction化可能是最容易做的事情。

出于性能原因，您可能还想研究索引算法以有效地检索例如 k 个最相似的字符串。我不确定字符串编辑距离和 levenshtein 距离存在哪些索引结构。

一位同事有一个学生显然有一些工作令牌编辑距离，但我还没有看到或审查过代码。当他正在处理日志文件时，他可能会使用基于令牌的方法而不是字符。

string - Clustering string data with ELKI

1 回答 1

Related

Reference