0

我有一些文章保存在数据库中。在某些页面上,我想根据某些设置显示一定百分比的文章。例如文章的 80%

问题是,如果我采用一定百分比的字符串长度,那么 html 不是纯文本,那么格式化会受到干扰,在我提供字符串和新长度(这将小于旧字符串长度)的某些功能中,有任何帮助吗?它会在不影响我尝试过的格式的情况下返回我截断的 html

Private Function HtmlSubstring(html As String, maxlength As Integer) As String
        'initialize regular expressions
        Dim htmltag As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
        Dim emptytags As String = "<(\w+)((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?></\1>"

        'match all html start and end tags, otherwise get each character one by one..
        Dim expression As Regex = New Regex(String.Format("({0})|(.?)", htmltag))
        Dim matches As MatchCollection = expression.Matches(html)

        Dim i As Integer = 0
        Dim content As New StringBuilder()
        For Each match As Match In matches
            If match.Value.Length = 1 AndAlso i < maxlength Then
                content.Append(match.Value)
                i += 1
                'the match contains a tag
            ElseIf match.Value.Length > 1 Then
                content.Append(match.Value)
            End If
        Next

        Return Regex.Replace(content.ToString(), emptytags, String.Empty)
    End Function

但并不总是有效

4

2 回答 2

1

我很确定没有内置的 .NET 方法可以满足您的要求。但是,请考虑以下方法:

您的 HTML 页面可能是结构化的,即它有段落、标题等:

<h1>...</h1>
<p>...</p>
<h2>...</h2>
<p>...<more tags>...</more tags></p>
<h2>...</h2>
<p>...</p>
...

你可以做的是:

  1. 使用 HTML 解析器(在此上下文中经常提到HTML 敏捷包)并将您的 HTML 解析为数据结构。
  2. 取前 80%的顶级标签。例如,如果您的 HTML 内容的根节点有十个子节点,则取前八个:

    <h1>...</h1>
    <p>...</p>
    <p>...</p>
    <h2>...</h2>
    <p>
       ...
       <more tags>
          ...
       </more tags>
       ...
    </p>
    <p>...</p>
    <p>...<more tags>...</more tags>...</p>
    <p>...</p>
    ---------------
    <h2>...</h2>
    <p>...</p>
    

如果您的文章间距大致均匀(即,您的长段落和短段落在整个文章过程中平均分布),这将为您提供大约80% 的文本,而不会破坏任何 HTML 格式。作为额外的好处,您不会在中行或中段拆分文本。

于 2013-03-01T07:02:24.450 回答
0

最后以下对我来说效果很好

 Private Function HtmlSubstring(ByRef html As String, maxlength As Integer) As String
    'initialize regular expressions
    Const htmltag As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
    'match all html start and end tags, otherwise get each character one by one..
    Dim expression As Regex = New Regex(String.Format("({0})|(.?)", htmltag))
    Dim matches As MatchCollection = expression.Matches(html)
    Dim i As Integer = 0
    Dim isEndingSet As Boolean = False
    Dim content As StringBuilder = New StringBuilder()
    For Each match As Match In matches
        If match.Value.Length = 1 AndAlso i < maxlength Then
            content.Append(match.Value)
            'the match contains a tag
            i += 1
        ElseIf match.Value.Length > 1 Then
            If (isEndingSet AndAlso (match.Value.ToLower() = "<br />" OrElse match.Value.ToLower() = "<br>")) Then
                Continue For
            End If
            content.Append(match.Value)
        End If
        If (i = maxlength AndAlso Not isEndingSet) Then
            content.Append("....")
            isEndingSet = True
        End If
    Next

    Return content.ToString()
End Function
于 2013-03-08T07:22:53.063 回答