html - 用于 Html 百分比的 vb.net 函数

Question

我有一些文章保存在数据库中。在某些页面上，我想根据某些设置显示一定百分比的文章。例如文章的 80%

问题是，如果我采用一定百分比的字符串长度，那么 html 不是纯文本，那么格式化会受到干扰，在我提供字符串和新长度（这将小于旧字符串长度）的某些功能中，有任何帮助吗？它会在不影响我尝试过的格式的情况下返回我截断的 html

Private Function HtmlSubstring(html As String, maxlength As Integer) As String
        'initialize regular expressions
        Dim htmltag As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
        Dim emptytags As String = "<(\w+)((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?></\1>"

        'match all html start and end tags, otherwise get each character one by one..
        Dim expression As Regex = New Regex(String.Format("({0})|(.?)", htmltag))
        Dim matches As MatchCollection = expression.Matches(html)

        Dim i As Integer = 0
        Dim content As New StringBuilder()
        For Each match As Match In matches
            If match.Value.Length = 1 AndAlso i < maxlength Then
                content.Append(match.Value)
                i += 1
                'the match contains a tag
            ElseIf match.Value.Length > 1 Then
                content.Append(match.Value)
            End If
        Next

        Return Regex.Replace(content.ToString(), emptytags, String.Empty)
    End Function

但并不总是有效

score 1 · Accepted Answer

我很确定没有内置的 .NET 方法可以满足您的要求。但是，请考虑以下方法：

您的 HTML 页面可能是结构化的，即它有段落、标题等：

<h1>...</h1>
<p>...</p>
<h2>...</h2>
<p>...<more tags>...</more tags></p>
<h2>...</h2>
<p>...</p>
...

你可以做的是：

使用 HTML 解析器（在此上下文中经常提到HTML 敏捷包）并将您的 HTML 解析为数据结构。

取前 80%的顶级标签。例如，如果您的 HTML 内容的根节点有十个子节点，则取前八个：

<h1>...</h1>
<p>...</p>
<p>...</p>
<h2>...</h2>
<p>
   ...
   <more tags>
      ...
   </more tags>
   ...
</p>
<p>...</p>
<p>...<more tags>...</more tags>...</p>
<p>...</p>
---------------
<h2>...</h2>
<p>...</p>

如果您的文章间距大致均匀（即，您的长段落和短段落在整个文章过程中平均分布），这将为您提供大约80% 的文本，而不会破坏任何 HTML 格式。作为额外的好处，您不会在中行或中段拆分文本。

score 0 · Accepted Answer

最后以下对我来说效果很好

 Private Function HtmlSubstring(ByRef html As String, maxlength As Integer) As String
    'initialize regular expressions
    Const htmltag As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
    'match all html start and end tags, otherwise get each character one by one..
    Dim expression As Regex = New Regex(String.Format("({0})|(.?)", htmltag))
    Dim matches As MatchCollection = expression.Matches(html)
    Dim i As Integer = 0
    Dim isEndingSet As Boolean = False
    Dim content As StringBuilder = New StringBuilder()
    For Each match As Match In matches
        If match.Value.Length = 1 AndAlso i < maxlength Then
            content.Append(match.Value)
            'the match contains a tag
            i += 1
        ElseIf match.Value.Length > 1 Then
            If (isEndingSet AndAlso (match.Value.ToLower() = "<br />" OrElse match.Value.ToLower() = "<br>")) Then
                Continue For
            End If
            content.Append(match.Value)
        End If
        If (i = maxlength AndAlso Not isEndingSet) Then
            content.Append("....")
            isEndingSet = True
        End If
    Next

    Return content.ToString()
End Function

html - 用于 Html 百分比的 vb.net 函数

2 回答 2

Related

Reference