html - 将字符串修剪为忽略 HTML 的长度

Question

这个问题是一个具有挑战性的问题。我们的应用程序允许用户在主页上发布新闻。该新闻是通过允许 HTML 的富文本编辑器输入的。在主页上，我们只想显示新闻项目的截断摘要。

例如，这里是我们显示的全文，包括 HTML

为了在办公室和厨房里腾出更多空间，我把所有随机的杯子都拿出来放在午餐室的桌子上。除非您对 1992 年的 Cheyenne Courier 马克杯或 1997 年的 BC Tel Advanced Communications 马克杯的所有权有强烈的感觉，否则它们将被放入一个盒子并捐赠给比我们更需要马克杯的办公室。

我们希望将新闻项修剪为 250 个字符，但不包括 HTML。

我们目前用于修剪的方法包括 HTML，这会导致一些 HTML 重的新闻帖子被大大截断。

例如，如果上面的示例包含大量 HTML，它可能看起来像这样：

为了在办公室、厨房里腾出更多空间，我拉...

这不是我们想要的。

有没有人有办法标记 HTML 标签以保持字符串中的位置，对字符串执行长度检查和/或修剪，并在其旧位置恢复字符串内的 HTML？

score 10 · Accepted Answer

从帖子的第一个字符开始，遍历每个字符。每次你越过一个角色，增加一个计数器。当你找到一个 '<' 字符时，停止增加计数器直到你碰到一个 '>' 字符。当计数器到达 250 时，您的位置是您实际想要切断的位置。

请注意，当 HTML 标记在截止之前打开但未关闭时，您将不得不处理另一个问题。

score 2 · Accepted Answer

根据 2-state 有限机器的建议，我刚刚为此目的开发了一个简单的 HTML 解析器，用 Java 编写：

http://pastebin.com/jCRqiwNH

这里有一个测试用例：

http://pastebin.com/37gCS4tV

这里是Java代码：

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

score 0 · Accepted Answer

这是我在 C# 中提出的实现：

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

还有一些我通过 TDD 使用的单元测试：

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

score 0 · Accepted Answer

如果我正确理解了这个问题，您希望保留 HTML 格式，但您不想将其计为您保留的字符串长度的一部分。

您可以使用实现简单有限状态机的代码来完成此操作。

2 种状态：InTag、OutOfTag
InTag：
- 如果>遇到字符，则转到 OutOfTag - 遇到
任何其他字符时转到自身
OutOfTag：
- 如果<遇到字符，则转到 InTag - 遇到
任何其他字符时转到自身

您的起始状态将是 OutOfTag。

您通过一次处理 1 个字符来实现有限状态机。每个角色的处理都将您带到一个新的状态。

当您通过有限状态机运行文本时，您还希望保留一个输出缓冲区和到目前为止遇到的长度变量（这样您就知道何时停止）。

每次您处于 OutOfTag 状态并处理另一个字符时，增加您的 Length 变量。如果您有空格字符，您可以选择不增加此变量。
当您没有更多字符或您具有＃1中提到的所需长度时，您将结束算法。
在您的输出缓冲区中，包括您遇到的字符，直到 #1 中提到的长度。
保留一堆未封闭的标签。当您达到长度时，为堆栈中的每个元素添加一个结束标记。在运行算法时，您可以通过保留 current_tag 变量来知道何时遇到标签。此 current_tag 变量在您进入 InTag 状态时启动，并在您进入 OutOfTag 状态时结束（或在 InTag 状态下遇到白色字符时）。如果您有开始标签，则将其放入堆栈中。如果您有结束标签，则将其从堆栈中弹出。

score 0 · Accepted Answer

我知道这比发布日期晚了很多，但我有一个类似的问题，这就是我最终解决它的方式。我担心的是正则表达式与通过数组交互的速度。

此外，如果您在 html 标记之前有一个空格，并且在此之后不能解决该问题

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

score 0 · Accepted Answer

你可以试试下面的 npm 包

修剪-html

它切断了 html 标签内的足够文本，保存原始 html 限制，达到限制后删除 html 标签并关闭打开的标签。

score -1 · Accepted Answer

最快的方法不是使用jQuery的text()方法吗？

例如：

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

将在变量中给出值 OneTwoThree text。这将允许您在不包含 HTML 的情况下获得文本的实际长度。

html - 将字符串修剪为忽略 HTML 的长度

7 回答 7

Related

Reference