我使用 Excel VBA 构建了一个网络爬虫,它执行以下操作:

  1. 从名为“CIK_Links”的工作表中的链接列表中一次读取一个链接。
  2. 它转到链接,读取其响应文本,如果在该响应文本中找到一个超链接,其 innerHTML 为“(所有基金和类别/合同的列表”),那么它将该链接保存到一个变量中并创建另一个 MSXML2.ServerXMLhttp .6.0 对象。
  3. 创建对象后,它会在响应文本中找到第三个表,循环并找到该表的特定元素,然后将这些值输出到 Excel 中名为“Parsed_Tables”的工作表中。
  4. 然后代码转到“CIK_Links”表上的下一个链接并重复步骤 1-3。注意:表格中有大约 640,000 个链接,但我一次只运行几千个链接。是的,我曾尝试一次运行 10、20、100 次,但问题仍然存在。

我遇到的问题是,一旦我点击运行,我就会收到消息“Excel 没有响应”,但代码仍然在后台运行。考虑到我要求它执行的操作,该代码运行良好并且速度非常快,但显然我需要进一步优化它以防止它使 Excel 过载。找到某种方法来避免在每次迭代时将解析的 HTML 写入 Excel 会很有帮助,但是,我不知道如何在不这样做的情况下以我需要的格式写入数据。数组解决方案会很棒,但是在将数组中的数据写入 Excel 之前,必须对其进行大量操作,甚至可能对数组进行子集化/切片。我需要帮助,因为我已经用尽了我所有的知识,并且在构建这个应用程序的过程中我做了很多研究。我什至愿意使用其他技术,如 python 和 beautifulsoup 库,我只是不知道如何以我需要的格式将表数据输出到 csv 文件。提前致谢!

这是文件: TrustTable_Parse.xlsb

免责声明:我拥有数学学士学位,并且通过在每种语言中实现我自己的许多项目,自学了如何使用 VBA、SQL 和 R 进行编码。重点是,如果我的代码看起来很奇怪,或者你认为我做事效率低下,那是因为我已经多年没有编码了,而且我不知道更好,哈哈。


Option Explicit

Sub Final_Parse_TrustTables()

Dim HTML As New HTMLDocument
Dim http As Object
Dim links As Object
Dim Url, Trst As String
Dim link As HTMLHtmlElement
Dim i As Long

Dim http2 As Object
Dim HTML2 As New HTMLDocument
Dim tbl As Object
Dim ele As HTMLHtmlElement

Dim wb As Workbook
Dim ws, ws_2 As Worksheet

    'sets ScreenUpdating to false _ 
     turns off event triggers, ect.

 Set wb = ThisWorkbook

 Set ws = wb.Sheets("CIK_Links")

 'Creates this object to see if Trust table exists
 Set http = CreateObject("MSXML2.ServerXMLhttp.6.0")

  'Loops through the list of URL's _
  in the 'CIK_Links' Sheet
  For i = 2 To 3000

   'List of URL's
    Url = ws.Range("C1").Cells(i, 1).Value2

    'Gets webpage to check _
    if Trust table exists
    On Error Resume Next
    http.Open "GET", Url, False

    'Runs code If the website sent a valid response to our request _
    for FIRST http object
    If Err.Number = 0 Then

     If http.Status = 200 Then

      'If the website sent a valid response to our request _
      for SECOND http object "http2"
      If Err.Number = 0 Then

       If http2.Status = 200 Then

        HTML.body.innerHTML = http.responseText

        Set links = HTML.getElementsByTagName("a")

        'Determines if there is a trust table and if so _
        then it creates the http2 object and gets the _
        trust table responsetext 
        Trst = "(List all Funds and Classes/Contracts for"
        For Each link In links
            'Link is returned in responsetext with "about:/" at _
            the beginning instead of https://www.sec.gov/, so I _
            used this to replace the "about:/"
            If InStr(link.innerHTML, Trst) > 0 Then
                link = Replace(link, "about:/", "https://www.sec.gov/")
                Debug.Print link

        'Creates this object to go to trust table webpage
        Set http2 = CreateObject("MSXML2.ServerXMLhttp.6.0")

        'Gets webpage to parse _
        trust table
        On Error Resume Next
        http2.Open "GET", link, False

            HTML2.body.innerHTML = http2.responseText

                'If there exists a Trust, then this refers to the _
                3rd table on the trust table webpage; _
                note ("table")(3)
                On Error Resume Next
                Set tbl = HTML2.getElementsByTagName("table")(3)

                Set ws_2 = wb.Sheets("Parsed_Tables")

                With ws_2

                    For Each ele In tbl.getElementsByTagName("tr")
                    'First finds rows with Class/Con numbers
                    If InStr(ele.innerText, "C00") Then
                     'Pulls Class/Con Numbers, note children(2)
                       'output to col E sheet
                        .Cells(Rows.Count, "E"). _
                        End(xlUp).Offset(1, 0).Value2 = ele.Children(2).innerText

                      'Outputs Share Class, children(3)
                        'Output to col F sheet
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, 0).Value2 = ele.Children(3).innerText

                      'Not not all Funds have Ticker _
                       so this keeps the module from _
                       asking for object to be set
                      On Error Resume Next
                      'Outputs Ticker to excel
                         'Reads the last value in Col F and offsets Ticker to _
                         to show directly in adjacent cel in Col G
                         .Cells(Rows.Count, "F"). _
                         End(xlUp).Offset(0, 1).Value2 = ele.Children(4).innerText

                    'Pulls SIC number
                    ElseIf InStr(ele.innerText, "S00") Then
                        'Offsets from col F to be placed in col C
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -3).Value2 = ele.Children(1).innerText

                      'Pulls Fund Name
                        'Offsets from col F to col D
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -2).Value2 = ele.Children(2).innerText

                    'Pulls CIK number
                    ElseIf InStr(ele.Children(0).innerText, "000") Then
                        'Offset from col F to col A
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -5).Value2 = ele.Children(0).innerText

                      'Pulls Trust Name
                        'Offsets from col F to col B
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -4).Value2 = ele.Children(1).innerText

                    End If

                    'Counts the number of iterations of the loop _
                     and places it in the lower left corner of the _
                     Application.StatusBar = "Current Iteration: " & i


               End With

            End If


        End If

        MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
        Exit Sub

      End If
      On Error GoTo 0

     End If

     MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
     Exit Sub

    End If

On Error GoTo 0

 If i Mod 1000 = 0 Then
  Application.Wait (Now + TimeValue("0:00:03"))
 End If

Next i

    'sets everything back to normal after running code 

End Sub

以下是 CIK_Links 表中列出的链接示例:


我认为您的代码不会运行,除非至少有一个 On Error Resume Next 隐藏了一些运行时错误。例如,您If http2.Status = 200 Then 之前已经实例化了 http2 对象。

下面是一个绝对可以改进的方法,但它使用一个类来保存 xmlhttp 对象并提供检索所需信息的方法。您所需表格的布局使解析实际网页特别复杂。你可能希望坚持下去。我选择按原样使用表结构。也许,这至少可以为您提供一个框架。您可以将自定义优化子调用添加到其中。


看看是否可以对一个可以保存所有结果的超大结果数组进行估计,而不是一个数组数组,以便可以在 go 中完成写出。如果我有时间,我会做这个修改。

类 clsHTTP

Option Explicit

Private http As Object
Const SEARCH_TERM As String = "(List all Funds and Classes/Contracts"

Private Sub Class_Initialize()
    Set http = CreateObject("MSXML2.XMLHTTP")
End Sub

Public Function GetString(ByVal Url As String, Optional ByVal search As Boolean = False) As String
    Dim sResponse As String
    searchTermFound = False
    With http
        .Open "GET", Url, False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        sResponse = StrConv(.responseBody, vbUnicode)
        If InStr(sResponse, SEARCH_TERM) > 0 Then searchTermFound = True
        GetString = sResponse
    End With
End Function

Public Function GetLink(ByVal html As HTMLDocument) As String
    Dim i As Long, nodeList As Object
    Set nodeList = html.querySelectorAll("a")
    GetLink = vbNullString
    For i = 0 To nodeList.Length - 1
        If InStr(nodeList.item(i).innerText, SEARCH_TERM) > 0 Then
            GetLink = Replace$(nodeList.item(i).href, "about:/", "https://www.sec.gov/")
            Exit For
        End If
End Function

Public Function GetInfo(ByVal html As HTMLDocument) As Variant
    Dim CIK As String, table As HTMLTable, tables As Object, tRows As Object
    Dim arr(), tr As Object, td As Object, r As Long, c As Long

    Set tables = html.querySelectorAll("table")

    If tables.Length > 3 Then
        CIK = "'" & html.querySelector(".search").innerText
        Set table = tables.item(3)
        Set tRows = table.getElementsByTagName("tr")
        ReDim arr(1 To tRows.Length, 1 To 6)
        Dim numColumns As Long, numBlanks As Long

        For Each tr In tRows
            numColumns = tr.getElementsByTagName("td").Length
            r = r + 1: c = 2: numBlanks = 0
            If r > 4 Then
                arr(r - 4, 1) = CIK
                For Each td In tr.getElementsByTagName("td")
                    If td.innerText = vbNullString Then numBlanks = numBlanks + 1
                    arr(r - 4, c) = td.innerText
                    c = c + 1
                Next td
                If numBlanks = numColumns Then Exit For
            End If
        ReDim arr(1, 1)
        GetInfo = arr
        Exit Function
    End If

    arr = Application.Transpose(arr)
    ReDim Preserve arr(1 To 6, 1 To r - 4)
    arr = Application.Transpose(arr)
    GetInfo = arr
End Function

标准模块 1

Option Explicit
Public searchTermFound As Boolean

Public Sub GetInfo()
    Dim wsLinks As Worksheet, links(), link As Long, http As clsHTTP
    Dim lastRow As Long, html As HTMLDocument, newURL As String
    Set wsLinks = ThisWorkbook.Worksheets("CIK_Links")
    Set http = New clsHTTP
    Set html = New HTMLDocument
    With wsLinks
        lastRow = GetLastRow(wsLinks, 3)
        If lastRow = 2 Then
            ReDim links(1, 1)
            links(1, 1) = .Range("C2").Value
            links = .Range("C2:C" & lastRow).Value
        End If
    End With
    Dim results(), arr(), i As Long, j As Long
    ReDim results(1 To UBound(links, 1))
    For link = LBound(links, 1) To UBound(links, 1)

        If InStr(links(link, 1), "https://www.sec.gov") > 0 Then

            html.body.innerHTML = http.GetString(links(link, 1), True)

            If searchTermFound Then

                newURL = http.GetLink(html)
                html.body.innerHTML = http.GetString(newURL, False)
                arr = http.GetInfo(html)

                If UBound(arr, 1) > 1 Then
                    i = i + 1
                    results(i) = arr
                End If
            End If
        End If

    Dim wsOut As Worksheet
    Set wsOut = ThisWorkbook.Worksheets("Parsed_Tables")

    For j = 1 To i
        arr = results(j)
        With wsOut
             .Cells(GetLastRow(wsOut, 1), 1).Resize(UBound(arr, 1), UBound(arr, 2)) = arr
        End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
    End With
End Function
