11

I am looking to implement a VBA trie-building algorithm that is able to process a substantial English lexicon (~50,000 words) in a relatively short amount of time (less than 15-20 seconds). Since I am a C++ programmer by practice (and this is my first time doing any substantial VBA work), I built a quick proof-of-concept program that was able to complete the task on my computer in about half a second. When it came time to test the VBA port however, it took almost two minutes to do the same -- an unacceptably long amount of time for my purposes. The VBA code is below:

Node Class Module:

Public letter As String
Public next_nodes As New Collection
Public is_word As Boolean

Main Module:

Dim tree As Node

Sub build_trie()
    Set tree = New Node
    Dim file, a, b, c As Integer
    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file
    Do While Not EOF(file)
        Dim line As String
        Line Input #file, line
        wordlist.add line
    Loop
    For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wordlist.Item(a))
            Dim match As Boolean
            match = False
            Dim char As String
            char = Mid(wordlist.Item(a), b, 1)
            For c = 1 To current.next_nodes.Count
                If char = current.next_nodes.Item(c).letter Then
                    Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next c
            If Not match Then
                Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next a
End Sub

My question then is simply, can this algorithm be sped up? I saw from some sources that VBA Collections are not as efficient as Dictionarys and so I attempted a Dictionary-based implementation instead but it took an equal amount of time to complete with even worse memory usage (500+ MB of RAM used by Excel on my computer). As I say I am extremely new to VBA so my knowledge of both its syntax as well as its overall features/limitations is very limited -- which is why I don't believe that this algorithm is as efficient as it could possibly be; any tips/suggestions would be greatly appreciated.

Thanks in advance

NB: The lexicon file referred to by the code, "corncob_caps.txt", is available here (download the "all CAPS" file)

4

3 回答 3

17

这里有许多小问题和一些更大的机会。你确实说过这是你的第一个 vba 作品,所以如果我告诉你你已经知道的事情,请原谅我

小事第一:
Dim file, a, b, c As Integer将文件、a 和 b 声明为变体。 Integer是 16 位符号,因此可能存在溢出风险,请Long改用。

DIM'ing inside 循环会适得其反:与 C++ 不同,它们不是循环范围的。

真正的机会是:

For Each在可以迭代的地方使用集合:它比索引更快。

在我的硬件上,您的原始代码运行了大约 160 秒。这段代码大约需要 2.5 秒(加上将 word 文件加载到集合中的时间,大约 4 秒)

Sub build_trie()
    Dim t1 As Long
    Dim wd As Variant
    Dim nd As Node

    Set tree = New Node
    ' Dim file, a, b, c As Integer  : declares file, a, b as variant
    Dim file As Integer, a As Long, b As Long, c As Long     ' Integer is 16 bit signed

    Dim current As Node
    Dim wordlist As Collection
    Set wordlist = New Collection
    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    ' no point in doing inside loop, they are not scoped to the loop
    Dim line As String
    Dim match As Boolean
    Dim char As String
    Dim new_node As Node

    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist.Add line
    Loop


    t1 = GetTickCount
    For Each wd In wordlist ' for each is faster
    'For a = 1 To wordlist.Count
        Set current = tree
        For b = 1 To Len(wd)
            'Dim match As Boolean
            match = False
            'Dim char As String
            char = Mid$(wd, b, 1)
            For Each nd In current.next_nodes
            'For c = 1 To current.next_nodes.Count
                If char = nd.letter Then
                'If char = current.next_nodes.Item(c).letter Then
                    Set current = nd
                    'Set current = current.next_nodes.Item(c)
                    match = True
                    Exit For
                End If
            Next nd
            If Not match Then
                'Dim new_node As Node
                Set new_node = New Node
                new_node.letter = char
                current.next_nodes.Add new_node
                Set current = new_node
            End If
        Next b
        current.is_word = True
    Next wd

    Debug.Print "Time = " & GetTickCount - t1 & " ms"
End Sub

编辑:

将单词列表加载到动态数组中会将加载时间减少到亚秒。请注意 Redim Preserve 价格昂贵,因此请分块进行

    Dim i As Long, sz As Long
    sz = 10000
    Dim wordlist() As String
    ReDim wordlist(0 To sz)

    file = FreeFile
    Open "C:\corncob_caps.txt" For Input As file

    i = 0
    Do While Not EOF(file)
        'Dim line As String
        Line Input #file, line
        wordlist(i) = line
        i = i + 1
        If i > sz Then
            sz = sz + 10000
            ReDim Preserve wordlist(0 To sz)
        End If
        'wordlist.Add line
    Loop
    ReDim Preserve wordlist(0 To i - 1)

然后像循环一样

    For i = 0 To UBound(wordlist)
        wd = wordlist(i)
于 2011-10-07T22:54:22.503 回答
3

我没有使用 VBA 练习,但是 IIRC,使用 For Each 迭代 Collection 应该比使用数字快一点:

Dim i As Variant
For Each i In current.next_nodes
    If i.letter = char Then
        Set current = i
        match = True
        Exit For
    End If
Next node

您也没有使用 Collection 的全部功能。它是一个键值映射,而不仅仅是一个可调整大小的数组。如果将字母用作键,则可能会获得更好的性能,尽管查找不存在的键会引发错误,因此您必须使用丑陋的错误解决方法来检查每个节点。b 循环的内部如下所示:

Dim char As String
char = Mid(wordlist.Item(a), b, 1)
Dim node As Node
On Error Resume Next
Set node = Nothing
Set node = current.next_nodes.Item(char)
On Error Goto 0
If node Is Nothing Then
    Set node = New Node
    current.next_nodes.add node, char
Endif
Set current = node

这样你就不需要类 Node 上的 letter 变量了。

我没有测试这个。我希望一切都好...

编辑:修复了 For Each 循环。


您可以做的另一件事可能会更慢但使用更少的内存是使用数组而不是集合,并根据每个添加的元素调整大小。数组不能在类上公开,所以你必须在类中添加方法来处理它:

Public letter As String
Private next_nodes() As Node
Public is_word As Boolean

Public Sub addNode(new_node As Node)
    Dim current_size As Integer
    On Error Resume Next
    current_size = UBound(next_nodes) 'ubound throws an error if the array is not yet allocated
    On Error GoTo 0
    ReDim next_nodes(0 To current_size) As Node
    Set next_nodes(current_size) = new_node
End Sub

Public Function getNode(letter As String) As Node
    Dim n As Variant
    On Error Resume Next
    For Each n In next_nodes
        If n.letter = letter Then
            Set getNode = n
            Exit Function
        End If
    Next
End Function

编辑:最后的优化策略是使用 Asc 函数获取 Integer char 值并将其存储而不是 String。

于 2011-10-07T21:36:11.827 回答
-1

您确实需要对其进行分析,但是如果您认为 Collections 很慢,也许您可​​以尝试使用动态数组

于 2011-10-07T21:37:40.047 回答