html - 从字符串中剥离 HTML

Question

我已经尝试了很多事情，但似乎没有任何工作正常。我有一个 Access DB 并且正在用 VBA 编写代码。我有一个 HTML 源代码字符串，我有兴趣从其中剥离所有 HTML 代码和标签，这样我就只有纯文本字符串，没有留下 html 或标签。做这个的最好方式是什么？

谢谢

score 8 · Accepted Answer

一种对不良标记尽可能有弹性的方法；

with createobject("htmlfile")
    .open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .close
    msgbox "text=" & .body.outerText
end with

score 6 · Accepted Answer

    Function StripHTML(cell As Range) As String  
 Dim RegEx As Object  
 Set RegEx = CreateObject("vbscript.regexp")  

 Dim sInput As String  
 Dim sOut As String  
 sInput = cell.Text  

 With RegEx  
   .Global = True  
   .IgnoreCase = True  
   .MultiLine = True  
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.  
 End With  

 sOut = RegEx.Replace(sInput, "")  
 StripHTML = sOut  
 Set RegEx = Nothing  
End Function

这可能对你有帮助，祝你好运。

score 3 · Accepted Answer

这取决于 html 结构的复杂程度以及您想要从中获取多少数据。

根据复杂性，您可能会使用正则表达式，但对于复杂的标记，尝试使用正则表达式解析来自 html 的数据就像尝试用叉子吃汤一样。

您可以使用 htmFile 对象将平面文件转换为可以与之交互的对象，例如：

Function ParseATable(url As String) As Variant 

    Dim htm As Object, table As Object 
    Dim data() As String, x As Long, y As Long 
    Set htm = CreateObject("HTMLfile") 
    With CreateObject("MSXML2.XMLHTTP") 
        .Open "GET", url, False 
        .send 
        htm.body.innerhtml = .responsetext 
    End With 

    With htm 
        Set table = .getelementsbytagname("table")(0) 
        Redim data(1 To table.Rows.Length, 1 To 10) 
        For x = 0 To table.Rows.Length - 1 
            For y = 0 To table.Rows(x).Cells.Length - 1 
                data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText 
            Next y 
        Next x 

        ParseATable = data 

    End With 
End Function

score 0 · Accepted Answer

Using early binding:

Public Function GetText(inputHtml As String) As String
With New HTMLDocument
    .Open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .Close
   StripHtml = .body.outerText
End With
End Function

score 0 · Accepted Answer

对上述之一的改进......它找到引号和换行符并将它们替换为非 HTML 等效项。此外，原始函数在嵌入 UNC 引用时存在问题（即：<\server\share\folder\file.ext>）。由于 < 开头和 > 结尾，它将删除整个 UNC 字符串。此函数修复了该问题，以便 UNC 正确插入到字符串中：

Function StripHTML(strString As String) As String
 Dim RegEx As Object
 Set RegEx = CreateObject("vbscript.regexp")

 Dim sInput As String
 Dim sOut As String
 sInput = Replace(strString, "<\\", "\\")

 With RegEx
   .Global = True
   .IgnoreCase = True
   .MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
 End With

 sOut = RegEx.Replace(sInput, "")
 StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, -    1), "&quot;", "'", 1, -1), "\\", "<\\", 1, -1)
 Set RegEx = Nothing
End Function

score -1 · Accepted Answer

我找到了一个非常简单的解决方案。由于系统限制和共享驱动器权限，我目前运行一个访问数据库并使用 excel 表单来更新系统。当我从 Access 调用数据时，我使用：Plaintext( YourStringHere ) 这将删除所有 html 部分，只留下文本。

希望这有效。

html - 从字符串中剥离 HTML

6 回答 6

Related

Reference