2

你们中有人知道是否可以在不使用 VBA 的情况下在 Excel 中实现聚类算法,如 k-means、dbscan?

如果可能的话,你能给我一点帮助吗?(一个小例子会帮助我)

非常感谢提前

克劳德

4

2 回答 2

1

你为什么会?Excel 不适用于此。

聚类算法通常受益于使用索引结构,以一种智能的方式组织内存中的数据。例如 R*-trees、kd-tree 等。

这有很大的不同。没有索引的 DBSCAN 是复杂的O(n^2),而有索引结构的只是O(n log n)复杂的。

我猜你可能可以在 VBA(它不再是真正的 Excel,而是 Visual Basic)中做到这一点。但是将现有代码用于 R*-trees 等更有意义。

于 2012-09-18T17:03:21.807 回答
1

检查这个

k 均值算法

k-Means 算法是以下步骤的迭代,直到达到稳定性,即单个记录的集群分配不再变化。

确定质心的坐标。(最初,质心是随机的、唯一的点,然后将集群成员的平均坐标分配给质心)。确定每个记录到每个质心的欧几里得距离。用最近的质心对记录进行分组。编码

首先,我创建了一个私有类型来表示我们的记录和质心,并创建了两个类级别的数组来保存它们以及一个类级别的变量来保存正在执行分析的表。

Private Type Records
    Dimension() As Double
    Distance() As Double
    Cluster As Integer
End Type

Dim Table As Range
Dim Record() As Records
Dim Centroid() As Records
User Interface

下面的方法,Run()可以作为起点,挂接到按钮等。

Sub Run()
'Run k-Means
   If Not kMeansSelection Then
        Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error")
    End If
End Sub

接下来,创建一种方法,提示用户选择要分析的表并输入数据应分组到的所需集群数量。该函数不需要任何参数并返回一个布尔值,指示是否遇到任何错误。

Function kMeansSelection() As Boolean

'Get user table selection
   On Error Resume Next
    Set Table = Application.InputBox(Prompt:= _
                                     "Please select the range to analyse.", _
                                     title:="Specify Range", Type:=8)

    If Table Is Nothing Then Exit Function        'Cancelled

    'Check table dimensions
   If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then
        Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns."
    End If

    'Get number of clusters
   Dim numClusters As Integer
    numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1)

    If Not numClusters > 0 Or numClusters = False Then
        Exit Function        'Cancelled
   End If
    If Err.Number = 0 Then
        If kMeans(Table, numClusters) Then
            outputClusters
        End If
    End If

kMeansSelection_Error:
    kMeansSelection = (Err.Number = 0)
End Function

如果选择了一个表并且适当地定义了多个集群,则调用 kMeans (Table, numClusters) 方法,并将表和集群数量作为参数。

如果该kMeans (Table, numClusters)方法执行没有错误,则会调用最终方法 outputClusters(),该方法会在活动工作簿中创建一个新工作表并输出分析结果。

将记录分配给集群

这是对记录进行实际分析并进行集群分配的地方。首先,该方法是用 Function 声明的kMeans(Table As Range, Clusters As Integer) As Boolean。该函数有两个参数,表被分析为 Excel Range 对象和 Clusters,一个整数表示要创建的集群的数量。

Function kMeans(Table As Range, Clusters As Integer) As Boolean
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns)
'Clusters - Number of clusters to reduce records into.

    On Error Resume Next

    'Script Performance Variables
   Dim PassCounter As Integer

    'Initialize Data Arrays
   ReDim Record(2 To Table.Rows.Count)
    Dim r As Integer        'record
   Dim d As Integer        'dimension index
   Dim d2 As Integer        'dimension index
   Dim c As Integer        'centroid index
   Dim c2 As Integer        'centroid index
   Dim di As Integer        'distance

    Dim x As Double        'Variable Distance Placeholder
   Dim y As Double        'Variable Distance Placeholder

On error Resume Next用于将错误传递给调用方法,并声明了许多数组索引变量。声明 x 和 y 以供以后在数学运算中使用。

第一步是将Record()数组的大小调整为表中的行数。(2 到 Table.Rows.Count)用于假设(并且要求)表的第一行包含列标题。

然后,对于数组中的每条记录Record(),Record 类型的Dimension()数组大小调整为列数(再次假设第一列包含行标题),Distance()数组大小调整为簇数。然后内部循环将行中列的值分配给Dimension()数组。

For r = LBound(Record) To UBound(Record) '初始化维度值数组 ReDim Record(r).Dimension(2 To Table.columns.Count) '初始化距离数组 ReDim Record(r).Distance(1 To Clusters) For d = LBound(Record(r).Dimension) To UBound(Record(r).Dimension) Record(r).Dimension(d) = Table.Rows(r).Cells(d).Value Next d Next r

以同样的方式,必须初始化初始质心。我已将前几条记录的坐标指定为初始质心坐标,检查每个新质心是否具有唯一坐标。如果不是,则脚本简单地移动到下一条记录,直到为质心找到一组唯一的坐标。

欧几里得距离这里用于计算质心唯一性的方法与后面用于计算单个记录与质心之间距离的方法几乎完全相同。在这里,通过测量它们的尺寸与 0 的距离来检查质心的唯一性。

    'Initialize Initial Centroid Arrays
   ReDim Centroid(1 To Clusters)
    Dim uniqueCentroid As Boolean

    For c = LBound(Centroid) To UBound(Centroid)
        'Initialize Centroid Dimension Depth
       ReDim Centroid(c).Dimension(2 To Table.columns.Count)

        'Initialize record index to next record
       r = LBound(Record) + c - 2

        Do        ' Loop to ensure new centroid is unique
           r = r + 1        'Increment record index throughout loop to find unique record to use as a centroid

            'Assign record dimensions to centroid
           For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
                Centroid(c).Dimension(d) = Record(r).Dimension(d)
            Next d

            uniqueCentroid = True

            For c2 = LBound(Centroid) To c - 1

                'Loop Through Record Dimensions and check if all are the same
               x = 0
                y = 0
                For d2 = LBound(Centroid(c).Dimension) To _
                    UBound(Centroid(c).Dimension)
                    x = x + Centroid(c).Dimension(d2) ^ 2
                    y = y + Centroid(c2).Dimension(d2) ^ 2
                Next d2

                uniqueCentroid = Not Sqr(x) = Sqr(y)
                If Not uniqueCentroid Then Exit For
            Next c2

        Loop Until uniqueCentroid

    Next c
The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster.

Dim lowestDistance As Double– 最低距离变量保存迄今为止在记录和质心之间测量的最短距离,以便根据后续测量进行评估。 Dim lastCluster As Integer– lastCluster 变量保存在进行任何新分配之前分配记录的集群,并用于评估是否已实现稳定性。 Dim ClustersStable As Boolean– 重复聚类分配和质心重新计算阶段,直到ClustersStable = true.

Dim minimumDistance As Double Dim lastCluster As Integer Dim ClustersStable As Boolean

Do 'While Clusters are not Stable

PassCounter = PassCounter + 1
ClustersStable = True        'Until Proved otherwise

'Loop Through Records

For r = LBound(Record) To UBound(Record)

    lastCluster = Record(r).Cluster
    lowestDistance = 0        'Reset lowest distance

    'Loop through record distances to centroids
   For c = LBound(Centroid) To UBound(Centroid)

        '======================================================
       '           Calculate Euclidean Distance
       '======================================================
       ' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2)
       '------------------------------------------------------
       ' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2
       ' d(p,q) = X

        x = 0
        y = 0
        'Loop Through Record Dimensions
       For d = LBound(Record(r).Dimension) To _
            UBound(Record(r).Dimension)
            y = Record(r).Dimension(d) - Centroid(c).Dimension(d)
            y = y ^ 2
            x = x + y
        Next d

        x = Sqr(x)        'Get square root

        'If distance to centroid is lowest (or first pass) assign record to centroid cluster.
       If c = LBound(Centroid) Or x < lowestDistance Then
            lowestDistance = x
            'Assign distance to centroid to record
           Record(r).Distance(c) = lowestDistance
            'Assign record to centroid
           Record(r).Cluster = c
        End If
    Next c

    'Only change if true
   If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster

Next r

一旦将每个记录分配给一个集群,集群的质心就会重新定位到集群的平均坐标。在质心移动后,重新评估最接近质心的每个记录并迭代该过程直到达到稳定性(即集群分配不再改变)。

'Move Centroids to calculated cluster average
       For c = LBound(Centroid) To UBound(Centroid)        'For every cluster

            'Loop through cluster dimensions
           For d = LBound(Centroid(c).Dimension) To _
                UBound(Centroid(c).Dimension)

                Centroid(c).Cluster = 0        'Reset nunber of records in cluster
               Centroid(c).Dimension(d) = 0        'Reset centroid dimensions

                'Loop Through Records
               For r = LBound(Record) To UBound(Record)

                    'If Record is in Cluster then
                   If Record(r).Cluster = c Then
                        'Use to calculate avg dimension for records in cluster

                        'Add to number of records in cluster
                       Centroid(c).Cluster = Centroid(c).Cluster + 1
                        'Add record dimension to cluster dimension for later division
                       Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _
                                                   Record(r).Dimension(d)

                    End If

                Next r

                'Assign Average Dimension Distance
               Centroid(c).Dimension(d) = Centroid(c).Dimension(d) / _
                                           Centroid(c).Cluster
            Next d
        Next c

    Loop Until ClustersStable

    kMeans = (Err.Number = 0)
End Function

显示结果

outputClusters()方法在两个表中输出结果。第一个表包含每个记录名称和分配的簇号,第二个表包含质心坐标。

Function outputClusters() As Boolean

    Dim c As Integer        'Centroid Index
   Dim r As Integer        'Row Index
   Dim d As Integer        'Dimension Index

    Dim oSheet As Worksheet
    On Error Resume Next

    Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook)

    'Loop Through Records
   Dim rowNumber As Integer
    rowNumber = 1

    'Output Headings
   With oSheet.Rows(rowNumber)
        With .Cells(1)
            .Value = "Row Title"
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
        With .Cells(2)
            .Value = "Centroid"
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
    End With

    'Print by Row
   rowNumber = rowNumber + 1        'Blank Row
   For r = LBound(Record) To UBound(Record)
        oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value
        oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster
        rowNumber = rowNumber + 1
    Next r

    'Print Centroids - Headings
   rowNumber = rowNumber + 1
    For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension)
        With oSheet.Rows(rowNumber).Cells(d)
            .Value = Table.Rows(1).Cells(d).Value
            .Font.Bold = True
            .HorizontalAlignment = xlCenter
        End With
    Next d

    'Print Centroids
   rowNumber = rowNumber + 1
    For c = LBound(Centroid) To UBound(Centroid)
        With oSheet.Rows(rowNumber).Cells(1)
            .Value = "Centroid " & c
            .Font.Bold = True
        End With
        'Loop through cluster dimensions
       For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension)
            oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d)
        Next d
        rowNumber = rowNumber + 1
    Next c

    oSheet.columns.AutoFit        '//AutoFit columns to contents

outputClusters_Error:
    outputClusters = (Err.Number = 0)
End Function

这种类型的输出不太可能有很大用处,但它用于演示在您自己的解决方案中可以访问记录集群分配或集群记录的方式。

outputClusters()函数使用另一个自定义方法:addWorksheet(),它将工作表添加到具有指定名称的指定/活动工作簿。如果已存在同名的工作表,则 outputClusters() 函数会添加/递增附加到工作表名称的数字。该WorksheetExists()功能还包括在以下内容中:

Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet
    On Error Resume Next
    '// If a Workbook wasn't specified, use the active workbook
   If Workbook Is Nothing Then Set Workbook = ActiveWorkbook

    Dim Num As Integer
    '// If a worksheet(s) exist with the same name, add/increment a number after the name
   While WorksheetExists(Name, Workbook)
        Num = Num + 1
        If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " ("))
        Name = Name & " (" & Num & ")"
    Wend

    '//Add a sheet to the workbook
   Set addWorksheet = Workbook.Worksheets.Add

    '//Name the sheet
   addWorksheet.Name = Name
End Function

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean
    On Error Resume Next
    WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "")
    On Error GoTo 0
End Function
于 2017-02-25T17:06:33.990 回答