我正在尝试抓取以下网站:https ://echa.europa.eu/brief-profile/-/briefprofile/100.047.293#ScientificProperties
到目前为止,我有以下代码返回每个h4
标签的内部文本。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
For Each SubName In SubTag
Debug.Print SubName.innerText
Next SubName
End Sub
虽然这会返回很多有用的信息,但也会返回带有标签名称的元素dt
某些结果是不想要的(突出显示),但我不确定如何限制此列表。探索 HTML 似乎没有任何标签/ID 来区分这些。这也让我想知道我是否以最好的方式提取信息?
同样,如果对于列表中每个未突出显示的项目,我希望在页面上捕获与它们关联的值,例如
C Physical state at 20°C and 1013 hPa Solid (100%) [1]
C Form Crystalline (100%) [1]
C Odour Other (100%) [1]
C Substance type Organic (100%) [1]
And so on...
此信息的标签名称是“dd”,但我不确定如何同时返回两个结果。我希望在即时窗口中,我可以获得物理和化学属性的列表,并且每个值的右侧也会返回值。
尝试这个我有以下代码导致不匹配错误,但我不明白我做错了什么。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
Dim SubInfo As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
Set SubInfo = SubTag.tags("dd")
For Each SubName In SubTag
Debug.Print SubName.innerText, SubInfo.innerText
Next SubName
End Sub
我很欣赏它的长篇文章,但如果有人可以评论我做错了什么,那就太好了。
更新:
以下代码更好地在即时窗口中实现所需的数据。
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubSectList As MSHTML.IHTMLElement
Dim SubSects As MSHTML.IHTMLElementCollection
Dim SubSect As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementsByClassName("col-xs-12 col-lg-10 MainContent")(1)
Set SubSects = SubSectList.getElementsByTagName("dt")
'Debug.Print SubSects.Length
For Each SubSect In SubSects
Debug.Print SubSect.innerText & " : "; SubSect.NextSibling.innerText
Next SubSect
End Sub