0

我想根据经度和纬度对推文进行聚类,并使用 OPTICS 算法(Java 实现),因为这似乎是基于密度的聚类的最佳选择。该算法采用输入文件来考虑要考虑的点。这些文件中的每一个都是一个向量。我拥有的数据集包含推文的纬度和经度。我可以直接使用纬度和经度来提取聚类,还是需要将纬度和经度转换为其他形式,然后才能使用 OPTICS 进行聚类。

提前致谢。

我拥有的示例输入文件:

37.3456227 -121.8847222
37.3904943 -121.8854337
37.2589827 -121.8847222
37.3558627 -121.8505679
37.3189149 -121.9416226
37.3052272 -121.9871217
37.3716914 -121.8619539
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002
37.2876164 -121.9857002

OPTICS 算法代码片段:

/**
     * Run the OPTICS algorithm
     * 
     * @param inputFile
     *            an input file path containing a list of vectors of double
     *            values
     * @param minPts
     *            the minimum number of points (see DBScan article)
     * @param epsilon
     *            the epsilon distance (see DBScan article)
     * @param seaparator
     *            the string that is used to separate double values on each line
     *            of the input file (default: single space)
     * @return a list of clusters (some of them may be empty)
     * @throws IOException
     *             exception if an error while writing the file occurs
     */
    public List<DoubleArrayOPTICS> computerClusterOrdering(String inputFile,
            int minPts, double epsilon, String separator)
            throws NumberFormatException, IOException {

        // record the start time
        timeExtractClusterOrdering = 0;
        long startTimestampClusterOrdering = System.currentTimeMillis();

        // Structure to store the vectors from the file
        List<DoubleArray> points = new ArrayList<DoubleArray>();

        // read the vectors from the input file
        BufferedReader reader = new BufferedReader(new FileReader(inputFile));
        String line;
        // for each line until the end of the file
        while (((line = reader.readLine()) != null)) {
            // if the line is a comment, is empty or is a
            // kind of metadata
            if (line.isEmpty() == true || line.charAt(0) == '#'
                    || line.charAt(0) == '%' || line.charAt(0) == '@') {
                continue;
            }
            line = line.trim();
            // split the line by spaces
            String[] lineSplited = line.split(separator);
            // create a vector of double
            double[] vector = new double[lineSplited.length];
            // for each value of the current line
            for (int i = 0; i < lineSplited.length; i++) {
                // convert to double
                double value = Double.parseDouble(lineSplited[i]);
                // add the value to the current vector
                vector[i] = value;
            }
            // add the vector to the list of vectors
            points.add(new DoubleArrayOPTICS(vector));
        }
        // close the file
        reader.close();

        // build kd-tree
        kdtree = new KDTree();
        kdtree.buildtree(points);

        // For debugging, you can print the KD-Tree by uncommenting the
        // following line:
        // System.out.println(kdtree.toString());

        // Variable to store the order of points generated by OPTICS
        clusterOrdering = new ArrayList<DoubleArrayOPTICS>();

        // For each point in the dataset
        for (DoubleArray point : points) {
            // if the node is already visited, we skip it
            DoubleArrayOPTICS pointDBS = (DoubleArrayOPTICS) point;
            if (pointDBS.visited == false) {
                // process this point
                expandClusterOrder(pointDBS, clusterOrdering, epsilon, minPts);
            }
        }

        // check memory usage
        MemoryLogger.getInstance().checkMemory();

        // record end time
        timeExtractClusterOrdering = System.currentTimeMillis()
                - startTimestampClusterOrdering;

        kdtree = null;

        // return the clusters
        return clusterOrdering;
    }
4

1 回答 1

0

ELKI 框架中 OPTICS 的标准实现非常适用于大圆距离。所以是的,这是可能的。

有关详细信息,请参见例如此答案:

如何使用 ELKI 进行索引 - OPTICS 聚类

该实现还支持索引并且速度非常快。

于 2015-12-08T07:40:29.827 回答