0

问题: 我需要使用 HtmlAgilityPack 检查一些 HTML 元素并组合标签名称。是否可以提取从父级到子级的每个标签,并将其替换为具有名称为“strikeUEmStrong”的类的跨度。此外,名称会根据 HTML 元素而变化。

类名的顺序确实很重要,我通过反复试验意识到了这一点。只要它能够获取所有元素并将它们组合在一起。它很可能会有多个具有不同格式级别的文本节点。

这将影响多个段落。

例如,如果我有这个 html 代码:

<p>
<strike><u><em><strong>four styles</strong></em></u></strike></p>

如何将其转换为:

<p>
<span class="strikeUEmStrong">four styles</span></p>

也可能有这种类型的代码:

<p>
    <strike><u><em><strong>four styles</strong></em></u></strike>&nbsp; <strike><u><em>three styles</em></u></strike></p>
<p>
    <em><strong>two styles</strong></em></p>

输出应如下所示:

<p>
<span class="strikeUEmStrong">four styles</span>&nbsp; <span class="strikeUEm">three styles<span></p><p><span class="emStrong">two styles<span></p>

原型:

'Retrive the class name of each format node
Function GetClassName(ByVal n As HtmlNode) As String
    Dim ret As String = String.Empty

    If (n.Name <> "#text") And (n.Name <> "p") Then
        ret = n.Name + " "
    End If

    'Get the next node
    For Each child As HtmlNode In n.ChildNodes
        ret &= GetClassName(child)
    Next

    Return ret
End Function

'Create a list of class names
Function GetClassNameList(ByVal classNameList As String) As List(Of String)
    Dim ret As New List(Of String)
    Dim classArr() As String = classNameList.Split(" ")

    For Each className As String In classArr
        ret.Add(className)
    Next

    Return ret
End Function

'Sort a list of class names and return a merged class string
Function GetSortedClassNameString(ByVal classList As List(Of String)) As String

    Dim sortedMergedClass As String = String.Empty

    classList.Sort()

    For Each className As String In classList
        sortedMergedClass &= className
    Next

    Return sortedMergedClass
End Function

'Lets point to the body node
Dim bodyNode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//body")

'Lets create some generic nodes
Dim currPNode As HtmlNode

Dim formatNodes As HtmlNodeCollection

Dim text As String = String.Empty
Dim textSize As Integer = 0

'Make sure the editor has something in it
If editorText <> "" Then

   'Send the text from the editor to the body node
    If bodyNode IsNot Nothing Then
       bodyNode.InnerHtml = editorText
    End If

    Dim pNode = bodyNode.SelectNodes("//p")

    Dim span As HtmlNode = htmlDoc.CreateElement("span")
    Dim tmpBody As HtmlNode = htmlDoc.CreateElement("body")
    Dim textNode As HtmlNode = htmlDoc.CreateTextNode

    Dim pCount As Integer = bodyNode.SelectNodes("//body/p").Count - 1

    For childCountP As Integer = 0 To pCount

        Dim paragraph = HtmlNode.CreateNode(htmlDoc.CreateElement("p").WriteTo)

        'Which paragraph I am at.
        currPNode = pNode.Item(childCountP)

        'For this paragraph get me the collection of html nodes
        formatNodes = currPNode.ChildNodes

        'Count how many Format nodes we have in a paragraph
        Dim formatCount As Integer = currPNode.ChildNodes.Count - 1

       'Go through each node and examine the elements. 
       'Then look at the markup to create classes and then group them under one span
       For child As Integer = 0 To formatCount

           'Iterate through the formateNodes, strike, em, strong, etc.
           Dim currFormatNode = HtmlNode.CreateNode(formatNodes(child).WriteTo)

           'TODO: Handle nested images and links? How do we know what to rip out?

           'First check for format nodes
           'Note, we can't let it use everything because it will change nested elements as well. I.E. span within span.
           If (currFormatNode.Name <> "#text") And (currFormatNode.Name = "strike") Or (currFormatNode.Name = "em") _
               Or (currFormatNode.Name = "strong") Or (currFormatNode.Name = "u") Or (currFormatNode.Name = "sub") _
               Or (currFormatNode.Name = "sup") Or (currFormatNode.Name = "b") Then

              'strip all tags, just take the inner text
              span.InnerHtml = currFormatNode.InnerText

              'Create a text node with text from the lowest node
              textNode = htmlDoc.CreateTextNode(span.InnerText)

              'Recursively go through the format nodes
              'Create a list from the string
              'Then sort the list and return a string
              'Appending the class to the span
               span.SetAttributeValue("class", GetSortedClassNameString(GetClassNameList(GetClassName(currFormatNode).Trim())))

              'Attach the span before the current format node
              currFormatNode.ParentNode.InsertBefore(span, currFormatNode)

             'Remove the formatted children leaving the above node
             currFormatNode.ParentNode.ChildNodes.Remove(currFormatNode)

             'We need to build a paragraph here
             paragraph.InnerHtml &= span.OuterHtml

             'Lets output something for debugging
             childNodesTxt.InnerText &= span.OuterHtml

             Else 'handle #text and other nodes seperately
                  'We need to build a paragraph here
                  paragraph.InnerHtml &= span.OuterHtml
                  textNode = htmlDoc.CreateTextNode(currFormatNode.InnerHtml)

                  'Lets output something for debugging
                  childNodesTxt.InnerText &= textNode.OuterHtml
             End If

        Next
        'End of formats

        'Start adding the new paragraph's to the body node
        tmpBody.AppendChild(paragraph)
     Next
     'End of paragraphs

    'Clean out body first and replace with new elements
    htmlDoc.DocumentNode.SelectSingleNode("//body").Remove()

    'Update our body
    htmlDoc.DocumentNode.SelectSingleNode("//html").AppendChild(tmpBody)

 End If

 htmlDoc.Save(Server.MapPath("html\editor.html"))
 End If

输出:

<span class="strikeuemstrong">four styles</span>

在我解决了订购问题后,终于得到了正确的输出。感谢您的帮助。

4

1 回答 1

2

这不是一个直截了当的问题。我将描述如何编写算法来执行此操作,并包含一些伪代码来提供帮助。

  1. 我会得到我的父母标签。我假设您想对所有“p”标签执行此操作
  2. 我会遍历我的子标签,获取标签名称并将其附加到类名中
  3. 我会递归地迭代孩子,直到我得到我附加的标签名称

伪代码。请原谅任何错别字,因为我正在快速输入。

public string GetClassName(Node n)
{
var ret = n.TagName;

foreach(var child in n.ChildNodes)
{
ret += GetClassName(child);
}

return ret;
}


foreach(var p in paragraphs)
{
foreach(var child in p.ChildNodes)
{
 var span = new Span();
 span.InnerText = child.InnerText; // strip all tags, just take the inner text

span.ClassName = GetClassName(child);

child.ReplaceWith(span); // note: if you do this with a FOREACH and not a for loop, it'll blow up C# for modifying the collection while iterating.  Use for loops. if you're going to do "active" replacement like in this pseudo code
}
}

一旦我获得更多上下文,我很乐意修改我的答案。如果您需要我完善我的建议,请查看我的建议并在更多上下文中对其进行评论。如果没有,我希望这能满足您的需求:)

于 2012-11-08T20:34:49.810 回答