We are using dtSearch to index some external web pages. It grabs the entire HTML content of the page.
When a page shows up in a list of search results on our web site, we want to show an excerpt of the content that contains their highlighted/bold search term as part of the result (in other words, the same thing everyone is used to seeing under each Google result).
What it the best way to accomplish this? Do you have to parse and remove the HTML tags? If so, how do you do that effectively?
We have a proof of concept working, showing the excerpt with the search terms highlighted, but we have to either render the tags, or try to strip them out (as we have tried) and end up with some garbage information that's not really content.
The fact that we are using dtSearch is incidental, I think. If an alternative search tool is capable of doing this type of thing in our behalf, we'd consider using that instead.
We are basically trying to decide if we need to author our own regular expressions to accomplish this or if it's a well-known problem that's already solved by some library or tool.
We happen to be using .NET/C#. I don't think it's central to the problem but might impact what libraries we can use.