html - 无法将一些损坏的链接变成合格的链接

Question

我正在尝试在 vba 中创建一个脚本，该脚本将在任何给定网站contact中查找任何链接或contact us链接，以便提供合格的/可用链接。我当前的脚本确实解析了联系链接，但大多数时候它们没有资格在以后重用，这意味着损坏的链接。

到目前为止我已经尝试过：

Sub FetchCustomizedLink()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim link As Variant, links As Variant, targetlink$

    links = Array( _
        "http://www.innovaprint.com.sg/", _
        "https://www.plexure.com.sg/", _
        "http://www.mount-zion.biz/", _
        "https://stackoverflow.com/" _
    )

    For Each link In links
        targetlink = None

        With Http
            .Open "GET", link, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            On Error Resume Next
            .send
            On Error GoTo 0
            Html.body.innerHTML = .responseText
        End With

        With Html.querySelectorAll("a[href]")
            For I = 0 To .Length - 1
                If InStr(1, .item(I).innerText, "contact", 1) > 0 Then
                    targetlink = .item(I).getAttribute("href")
                    Exit For
                End If
            Next I
        End With
        Debug.Print targetlink
    Next link
End Sub

我得到的输出：

about:/contact.html
https://www.plexure.com.sg/contact
about:contactus.html
https://stackoverflow.com/company/contact

我希望得到的输出：

http://www.innovaprint.com.sg/contact.html
https://www.plexure.com.sg/contact
http://www.mount-zion.biz/contactus.html
https://stackoverflow.com/company/contact

如何将损坏的链接变成合格的链接？

score 1 · Accepted Answer

最后我做到了。我必须在函数中使用InStr()函数Left()来挑选出基本 url，然后使用Replace()函数和Like运算符来构建合格的contact链接。

Sub FetchCustomizedLink()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim link As Variant, links As Variant, targetlink$
    Dim base$, refinedportion$, refinedlink$

    links = Array( _
        "http://www.innovaprint.com.sg/", _
        "https://www.plexure.com.sg/", _
        "http://www.mount-zion.biz/", _
        "https://stackoverflow.com/", _
        "https://www.yellowpages.com/" _
    )

    For Each link In links
        targetlink = None

        With Http
            .Open "GET", link, False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            On Error Resume Next
            .send
            On Error GoTo 0
            Html.body.innerHTML = .responseText
        End With

        With Html.querySelectorAll("a[href]")
            For I = 0 To .Length - 1
                If InStr(1, .item(I).innerText, "contact", 1) > 0 Then
                    targetlink = .item(I).getAttribute("href")
                    Exit For
                End If
            Next I
        End With

        If InStr(link, "http:") > 0 Then
            base = Left(link, InStr(8, link, "/") - 1)
        ElseIf InStr(link, "https:") > 0 Then
            base = Left(link, InStr(9, link, "/") - 1)
        End If

        refinedportion = Replace(targetlink, "about:", "")

        If refinedportion Like "[/]*" Then
            refinedlink = base & refinedportion
        ElseIf refinedportion Like "[h]*" Then
            refinedlink = refinedportion
        Else
            refinedlink = base & "/" & refinedportion
        End If
        Debug.Print refinedlink
    Next link
End Sub

它产生什么：

http://www.innovaprint.com.sg/contact.html
https://www.plexure.com.sg/contact
http://www.mount-zion.biz/contactus.html
https://stackoverflow.com/company/contact
https://www.yellowpages.com/about/contact-us

html - 无法将一些损坏的链接变成合格的链接

1 回答 1

Related

Reference