python - 我想帮助优化删除大查询集中匹配模型的代码

Question

我有一段 django 代码，用于迭代模型的查询集并删除任何匹配的模型。查询集变大了，这些动作实际上设置为周期性任务，因此速度成为问题。

这是代码，如果有人愿意尝试帮助优化它！

# For the below code, "articles" are just django models

all_articles = [a reallly large list of articles]
newest_articles = [some large list of new articles]
unique_articles = []
for new_article in newest_articles:
    failed = False
    for old_article in all_articles:
        # is_similar is just a method which checks if two strings are
        # identical to a certain degree
        if is_similar(new_article.blurb, old_article.blurb, 0.9) 
            and is_similar(new_article.title, old_article.title, 0.92):
            failed = True
            break
    if not failed:
        unique_articles.append(new_article)
return unique_articles

谢谢你们！

score 1 · Accepted Answer

解决此问题的一种方法可能是使用Haystack维护内容的 Solr索引，然后在 Solr 中搜索每篇文章的匹配项，然后将每篇文章的前几个匹配项提供给 is_similar 函数。不必搜索整个数据集来查找相似的文章会在性能上产生相当大的差异。

score 1 · Accepted Answer

在 SQL 级别实现“模糊 DISTINCT”似乎没有任何有效的方法，所以我建议采用预计算路线。试图从一个小代码片段中猜测你的业务逻辑，所以这可能是不合理的，但听起来你只需要知道每篇新文章是否有旧的欺骗（由 is_similar 函数定义）。在这种情况下，一种可行的方法可能是is_duplicate在文章模型中添加一个字段，并在保存文章时在后台作业中重新计算它。例如（使用芹菜）：

@task
def recompute_similarity(article_id):
    article = Article.objects.get(id=article_id)
    article.is_duplicate = False
    for other in Article.objects.exclude(id=article_id):
        if is_similar(article.title, other.title) or is_similar(article.blurb, other.blurb):
            article.is_duplicate = True
            break
     article.save()

def on_article_save(sender, instance, created, raw, **kwargs):
    if not raw:
        recompute_similarity.delay(instance.id)

signals.post_save.connect(on_article_save, sender=Article)

然后你原来的例程将减少到只是

Article.objects.filter(is_duplicate=False, ...recency condition)

python - 我想帮助优化删除大查询集中匹配模型的代码

2 回答 2

Related

Reference