我使用 Excel VBA 构建了一个网络爬虫,它执行以下操作:
- 从名为“CIK_Links”的工作表中的链接列表中一次读取一个链接。
- 它转到链接,读取其响应文本,如果在该响应文本中找到一个超链接,其 innerHTML 为“(所有基金和类别/合同的列表”),那么它将该链接保存到一个变量中并创建另一个 MSXML2.ServerXMLhttp .6.0 对象。
- 创建对象后,它会在响应文本中找到第三个表,循环并找到该表的特定元素,然后将这些值输出到 Excel 中名为“Parsed_Tables”的工作表中。
- 然后代码转到“CIK_Links”表上的下一个链接并重复步骤 1-3。注意:表格中有大约 640,000 个链接,但我一次只运行几千个链接。是的,我曾尝试一次运行 10、20、100 次,但问题仍然存在。
我遇到的问题是,一旦我点击运行,我就会收到消息“Excel 没有响应”,但代码仍然在后台运行。考虑到我要求它执行的操作,该代码运行良好并且速度非常快,但显然我需要进一步优化它以防止它使 Excel 过载。找到某种方法来避免在每次迭代时将解析的 HTML 写入 Excel 会很有帮助,但是,我不知道如何在不这样做的情况下以我需要的格式写入数据。数组解决方案会很棒,但是在将数组中的数据写入 Excel 之前,必须对其进行大量操作,甚至可能对数组进行子集化/切片。我需要帮助,因为我已经用尽了我所有的知识,并且在构建这个应用程序的过程中我做了很多研究。我什至愿意使用其他技术,如 python 和 beautifulsoup 库,我只是不知道如何以我需要的格式将表数据输出到 csv 文件。提前致谢!
这是文件: TrustTable_Parse.xlsb
免责声明:我拥有数学学士学位,并且通过在每种语言中实现我自己的许多项目,自学了如何使用 VBA、SQL 和 R 进行编码。重点是,如果我的代码看起来很奇怪,或者你认为我做事效率低下,那是因为我已经多年没有编码了,而且我不知道更好,哈哈。
下面是我的代码:
Option Explicit
Sub Final_Parse_TrustTables()
Dim HTML As New HTMLDocument
Dim http As Object
Dim links As Object
Dim Url, Trst As String
Dim link As HTMLHtmlElement
Dim i As Long
Dim http2 As Object
Dim HTML2 As New HTMLDocument
Dim tbl As Object
Dim ele As HTMLHtmlElement
Dim wb As Workbook
Dim ws, ws_2 As Worksheet
'sets ScreenUpdating to false _
turns off event triggers, ect.
OptimizeCode_Begin
Set wb = ThisWorkbook
Set ws = wb.Sheets("CIK_Links")
'Creates this object to see if Trust table exists
Set http = CreateObject("MSXML2.ServerXMLhttp.6.0")
'Loops through the list of URL's _
in the 'CIK_Links' Sheet
For i = 2 To 3000
'List of URL's
Url = ws.Range("C1").Cells(i, 1).Value2
'Gets webpage to check _
if Trust table exists
On Error Resume Next
http.Open "GET", Url, False
http.send
'Runs code If the website sent a valid response to our request _
for FIRST http object
If Err.Number = 0 Then
If http.Status = 200 Then
'If the website sent a valid response to our request _
for SECOND http object "http2"
If Err.Number = 0 Then
If http2.Status = 200 Then
HTML.body.innerHTML = http.responseText
Set links = HTML.getElementsByTagName("a")
'Determines if there is a trust table and if so _
then it creates the http2 object and gets the _
trust table responsetext
Trst = "(List all Funds and Classes/Contracts for"
For Each link In links
'Link is returned in responsetext with "about:/" at _
the beginning instead of https://www.sec.gov/, so I _
used this to replace the "about:/"
If InStr(link.innerHTML, Trst) > 0 Then
link = Replace(link, "about:/", "https://www.sec.gov/")
Debug.Print link
'Creates this object to go to trust table webpage
Set http2 = CreateObject("MSXML2.ServerXMLhttp.6.0")
'Gets webpage to parse _
trust table
On Error Resume Next
http2.Open "GET", link, False
http2.send
HTML2.body.innerHTML = http2.responseText
'If there exists a Trust, then this refers to the _
3rd table on the trust table webpage; _
note ("table")(3)
On Error Resume Next
Set tbl = HTML2.getElementsByTagName("table")(3)
Set ws_2 = wb.Sheets("Parsed_Tables")
With ws_2
For Each ele In tbl.getElementsByTagName("tr")
'First finds rows with Class/Con numbers
If InStr(ele.innerText, "C00") Then
'Pulls Class/Con Numbers, note children(2)
'output to col E sheet
.Cells(Rows.Count, "E"). _
End(xlUp).Offset(1, 0).Value2 = ele.Children(2).innerText
'Outputs Share Class, children(3)
'Output to col F sheet
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(1, 0).Value2 = ele.Children(3).innerText
'Not not all Funds have Ticker _
so this keeps the module from _
asking for object to be set
On Error Resume Next
'Outputs Ticker to excel
'Reads the last value in Col F and offsets Ticker to _
to show directly in adjacent cel in Col G
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(0, 1).Value2 = ele.Children(4).innerText
'Pulls SIC number
ElseIf InStr(ele.innerText, "S00") Then
'Offsets from col F to be placed in col C
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(1, -3).Value2 = ele.Children(1).innerText
'Pulls Fund Name
'Offsets from col F to col D
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(1, -2).Value2 = ele.Children(2).innerText
'Pulls CIK number
ElseIf InStr(ele.Children(0).innerText, "000") Then
'Offset from col F to col A
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(1, -5).Value2 = ele.Children(0).innerText
'Pulls Trust Name
'Offsets from col F to col B
.Cells(Rows.Count, "F"). _
End(xlUp).Offset(1, -4).Value2 = ele.Children(1).innerText
End If
'Counts the number of iterations of the loop _
and places it in the lower left corner of the _
workbook
Application.StatusBar = "Current Iteration: " & i
Next
End With
End If
Next
End If
Else
MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
Exit Sub
End If
On Error GoTo 0
End If
Else
MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
Exit Sub
End If
On Error GoTo 0
If i Mod 1000 = 0 Then
ActiveWorkbook.Save
Application.Wait (Now + TimeValue("0:00:03"))
End If
Next i
'sets everything back to normal after running code
OptimizeCode_End
End Sub
以下是 CIK_Links 表中列出的链接示例:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=11&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=13&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=14&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=17&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=18&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2110&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2135&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2145&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2663&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2664&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2691&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2768&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3521&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3794&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4123&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4405&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4568&owner=include&count=02