1

使用 Visual Basic(在我的案例 6 中)的人如何去除所有 HTML 标签并获得纯文本?我能够使用 HTML Purifier 完成此任务,但使用的是 PHP。VB6 中是否有一个函数、一个类或一个脚本可以让我这样做,因为我需要处理超过 5MB 的页面,而在 PHP 中它真的没有那么高效。

所以,再次,我如何转换这个:

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<p>Paragraph 1</p>
<div>Section</div>
Hello!
</body>
</html>

让我们这样说:

Paragraph 1
Section
Hello!

我想制作一个 API 系统来做到这一点,但发现它并不可靠。

PS: 我这样做是因为我正在为我的搜索引擎制作爬虫,我只有VB和PHP的经验。

提前致谢。

4

4 回答 4

2

我知道这个线程很旧,但我今天写了这个。它并不优雅,但效果很好。

    Public Function RemoveHTML(HTMLstring As String) As String

        IF NOT HTMLstring.contains("<") THEN return HTMLstring

        Dim DoRec As Boolean = False
        Dim textOut As String = ""

        Dim SkipMe As Boolean = False
        Dim SkipMeTag As String = ""

        For l = 1 To HTMLstring.Length
            Dim tmp As String = Mid(HTMLstring, l, 1)

            ' Enable skip-me mode (for large blocks of non-readable code)
            If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True : SkipMeTag = "script" : DoRec = False
            If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True : SkipMeTag = "style" : DoRec = False

            ' If we're already in skip-me mode, then figure out iff it's time to exit it.
            If SkipMe = True Then
                If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                    SkipMe = False
                    tmp = ""
                    l = l + Len(SkipMeTag) + 1
                    DoRec = False
                End If
            End If

            ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
            If SkipMe = False Then
                If tmp = ">" Then DoRec = True : textOut &= " " : tmp = ""
                If tmp = "<" Then DoRec = False : tmp = ""

                If DoRec = True Then
                    textOut &= tmp
                End If
            End If

        Next

        Return textOut
    End Function
于 2015-06-24T02:14:07.717 回答
2

@Matth3w 代码很棒,但不兼容 (VB6 - Visual Basic 6)

我已将他的代码降级为 vb6 并添加了一些有用的额外代码

1) 如果您的 HTML 文本包含 Unicode (UTF-8) 字符,添加 (Microsoft Forms 2 Object Library) 并将其 (Textbox) 用于 (输入和输出)

2) 添加 2 个文本框和 1 个命令按钮

3)设置文本框属性:(MultiLine = true)(将字体更改为Tahoma或不是:Ms Sans Serif)(滚动条:3)

4) 将以下代码粘贴到代码区:

Private Sub Command1_Click()
    TextBox2.Text = RemoveHTML(TextBox1.Text)
End Sub

Public Function RemoveHTML(HTMLstring As String) As String

    Dim DoRec As Boolean
    Dim textOut As String

    Dim SkipMe As Boolean
    Dim SkipMeTag As String

    Dim tmp As String

    HTMLstring = Replace(LCase(HTMLstring), "</p>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br/>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "&zwnj;", " ")
    HTMLstring = Replace(LCase(HTMLstring), "&nbsp;", " ")

    HTMLstring = Replace(LCase(HTMLstring), "&sect;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&ndash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&mdash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&rlm;", "")
    HTMLstring = Replace(LCase(HTMLstring), "&ldquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rdquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&lsquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rsquo;", ChrW(34))

    HTMLstring = Replace(LCase(HTMLstring), "&laquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&raquo;", ChrW(34))

    For l = 1 To Len(HTMLstring)
        tmp = Mid(HTMLstring, l, 1)

        ' Enable skip-me mode (for large blocks of non-readable code)
        If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True: SkipMeTag = "script": DoRec = False
        If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True: SkipMeTag = "style": DoRec = False

        ' If we're already in skip-me mode, then figure out iff it's time to exit it.
        If SkipMe = True Then
            If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                SkipMe = False
                tmp = ""
                l = l + Len(SkipMeTag) + 1
                DoRec = False
            End If
        End If

        ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
        If SkipMe = False Then
            If tmp = ">" Then DoRec = True: textOut = textOut & " ": tmp = ""
            If tmp = "<" Then DoRec = False: tmp = ""

            If DoRec = True Then
                textOut = textOut & tmp
            End If
        End If

    Next

    RemoveHTML = textOut
End Function

(支持波斯语)将旧波斯语 ي 更改为新波斯语 ی 您可以添加以下行:

HTMLstring = Replace(LCase(HTMLstring), ChrW(1610), ChrW(1740))

更新:这个函数有一个重要的错误。如果它在你的变量中没有找到任何 html 标签,它会返回空值!为了安全起见,请使用以下条件:

if len(RemoveHTML(variable))>0 then variable=RemoveHTML(variable)
于 2017-01-04T07:58:21.610 回答
1

我有一个 C# 的片段......但你可以很容易地将它移植到 VB :)

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
            return Regex.Replace(source, "<.*?>", string.Empty);
}
于 2013-07-19T20:41:13.480 回答
1

考虑到您发现的大多数 HTML 存在多么大的缺陷,我发现使用HTML Parsing中描述的技术要容易得多?先整理一下

清理后的 HTML 然后适合使用多种技术中的任何一种进行解析,从将其加载到 XML DOM 中,到使用 SAX 解析器,再到手动编码解析,再到正则表达式(如果你坚持让你的生活和生活任何追随你的维护者都很难)。

如果您的文档相当小,则 DOM 是最简单的方法。将清理后的 HTML 加载为 XML 后,您可以简单地遍历节点树,提取任何非空text属性。使用要忽略的标签的排除列表nodeName或值很容易。baseName

于 2013-07-20T04:59:25.433 回答