html - 如何在 Visual Basic 中剥离所有 HTML 标记和实体并获取 CLEAR 文本？

Question

使用 Visual Basic（在我的案例 6 中）的人如何去除所有 HTML 标签并获得纯文本？我能够使用 HTML Purifier 完成此任务，但使用的是 PHP。VB6 中是否有一个函数、一个类或一个脚本可以让我这样做，因为我需要处理超过 5MB 的页面，而在 PHP 中它真的没有那么高效。

所以，再次，我如何转换这个：

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<p>Paragraph 1</p>
<div>Section</div>
Hello!
</body>
</html>

让我们这样说：

Paragraph 1
Section
Hello!

我想制作一个 API 系统来做到这一点，但发现它并不可靠。

PS： 我这样做是因为我正在为我的搜索引擎制作爬虫，我只有VB和PHP的经验。

提前致谢。

score 2 · Accepted Answer

我知道这个线程很旧，但我今天写了这个。它并不优雅，但效果很好。

    Public Function RemoveHTML(HTMLstring As String) As String

        IF NOT HTMLstring.contains("<") THEN return HTMLstring

        Dim DoRec As Boolean = False
        Dim textOut As String = ""

        Dim SkipMe As Boolean = False
        Dim SkipMeTag As String = ""

        For l = 1 To HTMLstring.Length
            Dim tmp As String = Mid(HTMLstring, l, 1)

            ' Enable skip-me mode (for large blocks of non-readable code)
            If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True : SkipMeTag = "script" : DoRec = False
            If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True : SkipMeTag = "style" : DoRec = False

            ' If we're already in skip-me mode, then figure out iff it's time to exit it.
            If SkipMe = True Then
                If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                    SkipMe = False
                    tmp = ""
                    l = l + Len(SkipMeTag) + 1
                    DoRec = False
                End If
            End If

            ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
            If SkipMe = False Then
                If tmp = ">" Then DoRec = True : textOut &= " " : tmp = ""
                If tmp = "<" Then DoRec = False : tmp = ""

                If DoRec = True Then
                    textOut &= tmp
                End If
            End If

        Next

        Return textOut
    End Function

score 2 · Accepted Answer

@Matth3w 代码很棒，但不兼容 (VB6 - Visual Basic 6)

我已将他的代码降级为 vb6 并添加了一些有用的额外代码

1) 如果您的 HTML 文本包含 Unicode (UTF-8) 字符，添加 (Microsoft Forms 2 Object Library) 并将其 (Textbox) 用于 (输入和输出)

2) 添加 2 个文本框和 1 个命令按钮

3）设置文本框属性：（MultiLine = true）（将字体更改为Tahoma或不是：Ms Sans Serif）（滚动条：3）

4) 将以下代码粘贴到代码区：

Private Sub Command1_Click()
    TextBox2.Text = RemoveHTML(TextBox1.Text)
End Sub

Public Function RemoveHTML(HTMLstring As String) As String

    Dim DoRec As Boolean
    Dim textOut As String

    Dim SkipMe As Boolean
    Dim SkipMeTag As String

    Dim tmp As String

    HTMLstring = Replace(LCase(HTMLstring), "</p>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br/>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "&zwnj;", " ")
    HTMLstring = Replace(LCase(HTMLstring), "&nbsp;", " ")

    HTMLstring = Replace(LCase(HTMLstring), "&sect;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&ndash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&mdash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&rlm;", "")
    HTMLstring = Replace(LCase(HTMLstring), "&ldquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rdquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&lsquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rsquo;", ChrW(34))

    HTMLstring = Replace(LCase(HTMLstring), "&laquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&raquo;", ChrW(34))

    For l = 1 To Len(HTMLstring)
        tmp = Mid(HTMLstring, l, 1)

        ' Enable skip-me mode (for large blocks of non-readable code)
        If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True: SkipMeTag = "script": DoRec = False
        If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True: SkipMeTag = "style": DoRec = False

        ' If we're already in skip-me mode, then figure out iff it's time to exit it.
        If SkipMe = True Then
            If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                SkipMe = False
                tmp = ""
                l = l + Len(SkipMeTag) + 1
                DoRec = False
            End If
        End If

        ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
        If SkipMe = False Then
            If tmp = ">" Then DoRec = True: textOut = textOut & " ": tmp = ""
            If tmp = "<" Then DoRec = False: tmp = ""

            If DoRec = True Then
                textOut = textOut & tmp
            End If
        End If

    Next

    RemoveHTML = textOut
End Function

（支持波斯语）将旧波斯语 ي 更改为新波斯语 ی 您可以添加以下行：

HTMLstring = Replace(LCase(HTMLstring), ChrW(1610), ChrW(1740))

更新：这个函数有一个重要的错误。如果它在你的变量中没有找到任何 html 标签，它会返回空值！为了安全起见，请使用以下条件：

if len(RemoveHTML(variable))>0 then variable=RemoveHTML(variable)

score 1 · Accepted Answer

我有一个 C# 的片段......但你可以很容易地将它移植到 VB :)

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
            return Regex.Replace(source, "<.*?>", string.Empty);
}

score 1 · Accepted Answer

考虑到您发现的大多数 HTML 存在多么大的缺陷，我发现使用HTML Parsing中描述的技术要容易得多？先整理一下。

清理后的 HTML 然后适合使用多种技术中的任何一种进行解析，从将其加载到 XML DOM 中，到使用 SAX 解析器，再到手动编码解析，再到正则表达式（如果你坚持让你的生活和生活任何追随你的维护者都很难）。

如果您的文档相当小，则 DOM 是最简单的方法。将清理后的 HTML 加载为 XML 后，您可以简单地遍历节点树，提取任何非空text属性。使用要忽略的标签的排除列表nodeName或值很容易。baseName

html - 如何在 Visual Basic 中剥离所有 HTML 标记和实体并获取 CLEAR 文本？

4 回答 4

Related

Reference