3

We are using dtSearch to index some external web pages. It grabs the entire HTML content of the page.

When a page shows up in a list of search results on our web site, we want to show an excerpt of the content that contains their highlighted/bold search term as part of the result (in other words, the same thing everyone is used to seeing under each Google result).

What it the best way to accomplish this? Do you have to parse and remove the HTML tags? If so, how do you do that effectively?

We have a proof of concept working, showing the excerpt with the search terms highlighted, but we have to either render the tags, or try to strip them out (as we have tried) and end up with some garbage information that's not really content.

The fact that we are using dtSearch is incidental, I think. If an alternative search tool is capable of doing this type of thing in our behalf, we'd consider using that instead.

We are basically trying to decide if we need to author our own regular expressions to accomplish this or if it's a well-known problem that's already solved by some library or tool.

We happen to be using .NET/C#. I don't think it's central to the problem but might impact what libraries we can use.

4

3 回答 3

3

Google uses meta description tags where present, and will also use rich snippet information where available.

Beyond that, you may need to perform custom parsing, but don't use regular expressions to perform the whole task. Rather, use a proper parser (such as HTML Aglity Pack) and find tags which make semantic sense (perhaps headings, paragraphs, etc.) Once you have located such elements, you might use a regex to determine which of the matched tags would give you the best snippet, where to truncate it, etc.

A simple flow:

  1. parse the document and locate all elements with a significant amount of textual content.
  2. strip inner tags (e.g. strong inside of a p)
  3. prefer elements near the beginning of the document.
  4. run an algorithm (possibly using regex(es), and possibly culture-aware) to try to extract sentences.
  5. strongly prefer sentences with words matching one or more search terms (based on your stated requirements).
  6. prefer sentences with few noise words.
  7. (advanced) prefer sentences which have words occurring regularly in the document.
  8. (advanced) combine multiple, potentially useful sentences into a single description snippet.

It's not an exact science, even for Google.

于 2013-02-01T18:26:46.060 回答
0

Here is what I use to generate the search summary for an item with dtsearch (with cache stored version of the document text) :

The key point here for your problem is rj.OutputFormat = dtSearch.Engine.OutputFormats.itUTF8; (which overrides the default html format) You should get a cleaned up summary with bold highlighting.

Hope this will help

public string GetSumary(String ItemEncoded)
{
    using (var res = new dtSearch.Engine.SearchResults())
    {
        res.UrlDecodeItem(ItemEncoded);
        res.GetNthDoc(0);

        using (var rj = res.NewSearchReportJob())
        {
            // next line asumes you store your document text version in cache. remove if not 
            rj.Flags |= dtSearch.Engine.ReportFlags.dtsReportGetFromCache;
            rj.Flags |= dtSearch.Engine.ReportFlags.dtsReportByWordExact;
            rj.Flags |= dtSearch.Engine.ReportFlags.dtsReportLimitContiguousContext;
            rj.OutputToString = true;
            rj.OutputFormat = dtSearch.Engine.OutputFormats.itUTF8;
            rj.OutputStringMaxSize = 2000;
            rj.MaxContextBlocks = 1;
            rj.WordsOfContext = 12;

            rj.Header = "";
            rj.FileHeader = "";
            rj.ContextHeader = "";
            rj.BeforeHit = "<b>";
            rj.AfterHit = "</b>";
            rj.ContextFooter = "";
            rj.ContextSeparator = " ... ";
            rj.FileFooter = "";
            rj.Footer = "";

            rj.SelectItems(0, 0);
            rj.Execute();

            // some final clean-up
            return
                    new Regex(@"[\t\r\n]+|[\.;\,\*]{2,}").Replace(rj.OutputString, "&nbsp; &nbsp;");            }
    }
}
于 2013-02-05T12:50:55.207 回答
0
Use dtsearch ISearchStatusHandler interface with OnFound method, OnFound method Called each time a document is found 

public class HomeController : Controller, ISearchStatusHandler
{

public void Search()
{
   SearchJob sj = new SearchJob();
   sj.Request = "fast";
   sj.IndexesToSearch.Add(@"D:\R & D\Indexpath\aaa");
   sj.SearchFlags = SearchFlags.dtsSearchSynonyms &         
   SearchFlags.dtsSearchWordNetRelated;        
   sj.Execute();
   SearchResults result = sj.Results;
}

 public void OnFound(SearchResultsItem item)
 {
        int DocId = item.DocId;
        string FileName = item.Filename;
 }

 public void OnSearchingFile(string filename)
 {
        throw new NotImplementedException();
 }

 public void OnSearchingIndex(string index)
 {
        throw new NotImplementedException();
 }
 }

There is a more organized and comprehensive way to work with the results of a search as they are obtained. The SearchJob object has a StatusHandler property that can be set to an object which has a set of methods that are called as the Search progresses. Using this you can process the files as they are found and you can keep the UI responsive by not hogging the UI thread. like : SJob1.StatusHandler = this; SJob1.Execute();
SJob1.StatusHandler = this; SJob1.ExecuteInThread(); StatusHandler call OnFound methed each time when document found so if document not found than OnFound method not execute so no more load.

于 2016-07-15T17:10:25.983 回答