c# - 基于内存的全文搜索

Question

我有一个社交功能，其结构类似于博客：帖子和评论。

帖子有一个称为正文的字段，评论也是如此。帖子和评论存储在 SharePoint 列表中，因此直接 SQL 全文查询不可用。

如果有人输入“11月的停电恢复效率”，我真的不知道如何根据帖子的内容及其附加评论正确返回帖子列表。

好消息是，我一次需要搜索的帖子永远不会超过 50-100 个。知道这一点，解决这个问题的最简单方法是让我将帖子和评论加载到内存中并通过循环搜索它们。

理想情况下，这样的事情将是最快的解决方案：

class Post
{
    public int Id;
    public string Body;
    public List<Comment> comments;
}
class Comment
{
    public int Id;
    public int ParentCommentId;
    public int PostId;
    public string Body;
}
public List<Post> allPosts;
public List<Comment> allComments;

public List<Post> postsToInclude (string SearchText)
{
    var returnList = new List<Post>();
    foreach(Post post in allPosts)
    {
        //check post.Body with bool isThisAMatch(SearchText, post.Body)
        //if post.Body is a good fit, returnList.add(post);
    }
    foreach(Comment comment in allComments)
    {
        //check post.Body with bool isThisAMatch(SearchText, comment.Body)
        //if comment.Body is a good fit, returnList.add(post where post.Id == comment.PostId);
    }
}

public bool isThisAMatch(string SearchText, string TextToSearch)
{
    //return yes or no if TextToSearch is a good match to SearchText
}

score 4 · Accepted Answer

That is not a trivial subject. As a machine does not have a concept of "content", it is inherently difficult to retrieve articles that are about a certain topic. To make an educated guess if each article is relevant to your search term(s), you have to use some proxy algorithms, e.g. TF-IDF.

Instead of implementing that yourself, I'd advise to use existing Information Retrieval Libraries. There are a few really popular out there. Based on my own experience, I'd suggest to have a closer look at Apache Lucene. A look at their reference list shows their significance.

If you have never had anything to do with information retrieval, I promise a very steep learning curve. To ease into the whole area, I suggest you use Solr first. It runs pretty much "out of the box" and gives you a nice idea of what is possible. I had a breakthrough when I started to really look at the available filters and each step of the algorithm. After that I had a better understanding of what to change to get better results. Depending on the format of your content, the system might require serious tweaking.

I have spent a LOT of time with Lucene, Solr and some alternatives in my job. The results I got in the end where acceptable, but it was a difficult process. It took a lot of theory, testing and prototyping to get there.

score -1 · Accepted Answer

注意：我没有这样做的经验，但这是我解决问题的方式。

首先，我会并行进行搜索。通过这样做，它将大大提高您的搜索功能的性能。

其次，因为您可以输入多个单词，所以我将创建一个评分系统，根据查询对评论进行评分。例如，如果评论有 2 个查询词，则它比仅包含 1 个查询词的评论具有更高的得分值。或者，如果评论包含完全匹配，也许它可能会获得非常高的分数。

无论如何，一旦基于并行循环中的查询输入对所有评论进行评分，就向用户显示最热门的评论作为他们的结果。另外请注意，这是因为数据集的大小（50-100）很小。

c# - 基于内存的全文搜索

2 回答 2

Related

Reference