0

使用全文搜索的搜索过程(这意味着:很难在过程之外重现匹配)返回突出显示内部匹配字符串的行,例如:

"i have been <em>match</em>ed"
"a <em>match</em> will happen in the word <em>match</em>"
"some random words including the word <em>match</em> here"

现在我需要获取字符串的前 x 个字符,但是里面的 html 标签遇到了一些麻烦。

喜欢:

"i have been <em>mat</em>..." -> first 15 characters
"a <em>match</em> will happen in the word <em>m</em>..." -> first 33 characters
"some rando..." -> first 10 characters

我试过用一些 if else,但最后我得到了一个大意大利面。

有小费吗?

4

2 回答 2

1

我建议编写一个带有几个状态的简单解析器 - InText, InOpeningTag,InClosingTag是我想到的几个。

只需遍历字符,弄清楚你是否是InText,只计算那些字符......一旦达到限制,不要再添加任何文本,如果你在开始和结束标签之间,只需添加结束标签。

如果您不知道我在说什么,请查看HTML Agility Pack的源代码(查找Parse方法)。

于 2012-06-22T19:22:48.353 回答
1

<em>这应该只基于标签做你想要的。

using System;
using System.Collections.Generic;
using System.Text;

namespace Test
{
    public class Program
    {
        public static void Main(string[] args)
        {
            var dbResults = GetMatches();
            var firstLine = HtmlSubstring(dbResults[0], 0, 15);
            Console.WriteLine(firstLine);
            var secondLine = HtmlSubstring(dbResults[1], 0, 33);
            Console.WriteLine(secondLine);
            var thirdLine = HtmlSubstring(dbResults[2], 0, 10);
            Console.WriteLine(thirdLine);

            Console.Read();
        }

        private static List<string> GetMatches()
        {
            return new List<string>
            {
                "i have been <em>match</em>ed"
                ,"a <em>match</em> will happen in the word <em>match</em>"
                , "some random words including the word <em>match</em> here"
            };
        }

        private static string HtmlSubstring(string mainString, int start, int length = int.MaxValue)
        {
            StringBuilder substringResult = new StringBuilder(mainString.Replace("</em>", "").Replace("<em>", "").Substring(start, length));

            // Get indexes between start and (start + length) that need highlighting.
            int matchIndex = mainString.IndexOf("<em>", start);
            while (matchIndex > 0 && matchIndex < (substringResult.Length - start))
            {
                int matchIndexConverted = matchIndex - start;
                int matchEndIndex = mainString.IndexOf("</em>", matchIndex) - start;

                substringResult.Insert(matchIndexConverted, "<em>");
                substringResult.Insert(Math.Min(substringResult.Length, matchEndIndex), "</em>");

                matchIndex = mainString.IndexOf("<em>", matchIndex + 1);
            }

            return substringResult.ToString();
        }
    }
}
于 2012-06-22T21:32:06.400 回答