1

我正在尝试抓取以下网站:https ://echa.europa.eu/brief-profile/-/briefprofile/100.047.293#ScientificProperties

到目前为止,我有以下代码返回每个h4标签的内部文本。

Sub getContents()
        
            Dim XMLReq As New MSXML2.XMLHTTP60
            Dim HTMLDoc As New MSHTML.HTMLDocument
            
            Dim SubTag As MSHTML.IHTMLElementCollection
            Dim SubName As MSHTML.IHTMLElement
            
            XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
            XMLReq.send
            
            If XMLReq.Status <> 200 Then
            
                MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
                Exit Sub
            End If
            
            HTMLDoc.body.innerHTML = XMLReq.responseText
            
            Set SubTag = HTMLDoc.getElementsByTagName("dt")
            
            For Each SubName In SubTag
            Debug.Print SubName.innerText
            Next SubName
            
        End Sub

虽然这会返回很多有用的信息,但也会返回带有标签名称的元素dt某些结果是不想要的(突出显示),但我不确定如何限制此列表。探索 HTML 似乎没有任何标签/ID 来区分这些。这也让我想知道我是否以最好的方式提取信息? 在此处输入图像描述

同样,如果对于列表中每个未突出显示的项目,我希望在页面上捕获与它们关联的值,例如

C Physical state at 20°C and 1013 hPa Solid (100%) [1]
C Form Crystalline (100%) [1]
C Odour Other (100%) [1]
C Substance type Organic (100%) [1]
    
    And so on...

此信息的标签名称是“dd”,但我不确定如何同时返回两个结果。我希望在即时窗口中,我可以获得物理和化学属性的列表,并且每个值的右侧也会返回值。

尝试这个我有以下代码导致不匹配错误,但我不明白我做错了什么。

Sub getContents()
        
            Dim XMLReq As New MSXML2.XMLHTTP60
            Dim HTMLDoc As New MSHTML.HTMLDocument
            
            Dim SubTag As MSHTML.IHTMLElementCollection
            Dim SubName As MSHTML.IHTMLElement
            Dim SubInfo As MSHTML.IHTMLElement
            
            XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
            XMLReq.send
            
            If XMLReq.Status <> 200 Then
            
                MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
                Exit Sub
            End If
            
            HTMLDoc.body.innerHTML = XMLReq.responseText
            
            Set SubTag = HTMLDoc.getElementsByTagName("dt")
            Set SubInfo = SubTag.tags("dd")
            
            For Each SubName In SubTag
            Debug.Print SubName.innerText, SubInfo.innerText
            Next SubName
            
        End Sub

我很欣赏它的长篇文章,但如果有人可以评论我做错了什么,那就太好了。

更新:

以下代码更好地在即时窗口中实现所需的数据。

Sub GetContents()
    
        Dim XMLReq As New MSXML2.XMLHTTP60
        Dim HTMLDoc As New MSHTML.HTMLDocument
        
        Dim SubSectList As MSHTML.IHTMLElement
        Dim SubSects As MSHTML.IHTMLElementCollection
        Dim SubSect As MSHTML.IHTMLElement
    
        XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
        XMLReq.send
        
        If XMLReq.Status <> 200 Then
        
            MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
            Exit Sub
        End If
        
        HTMLDoc.body.innerHTML = XMLReq.responseText
        
        Set SubSectList = HTMLDoc.getElementsByClassName("col-xs-12 col-lg-10 MainContent")(1)
        Set SubSects = SubSectList.getElementsByTagName("dt")

        'Debug.Print SubSects.Length
        
        For Each SubSect In SubSects
        Debug.Print SubSect.innerText & " : "; SubSect.NextSibling.innerText
       Next SubSect
 
    End Sub
4

1 回答 1

1

我认为您想限制为 EndpointContent 类元素的子元素的 dt;然后你可以链接 nextSibling 移动到相邻的 dd

Option Explicit

Public Sub GetContents()
    Dim XMLReq As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
        
    XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
    XMLReq.send
                  
    HTMLDoc.body.innerHTML = XMLReq.responseText

    Dim i As Long
    
    With HTMLDoc.querySelectorAll(".EndpointContent dt")
        For i = 0 To .Length - 1
            Debug.Print .Item(i).innerText & " : " & .Item(i).NextSibling.NextSibling.innerText
            Debug.Print
        Next
    End With
End Sub
于 2021-05-13T02:10:01.053 回答