mysql - Django 使用 Annotate 而不是 Distinct()

Question

我读过 distinct() API 调用有时会出现一些性能问题。我想尝试通过避免使用 distinct 的 orm 重写查询（至少分析差异）。

我的理解是 values() 在后台执行 Group By。但是，当我测试这两种方法时，对象的数量会有所不同，具体取决于我使用的是 distinct() 还是 values()/annotate()。

   zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()

VS。

  zip_codes = Location.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()

关于这里有什么问题的任何想法？

谢谢！

score 2 · Accepted Answer

我只是快速检查了您的查询与我拥有的具有类似查询的数据库。计数是相同的，所以我不确定您的数据会导致什么问题。

不过，我也会对这个前提高度怀疑。DISTINCT 确实是一个 CPU 密集型查询。但是， COUNT(*) 也是如此，您的第二个查询将首先运行带有 group by 的计数聚合，然后对结果运行 COUNT。我会为单个 DISTINCT 调用更快地投入资金（我还会检查您使用的任何数据库后端）。所有这些都与 django 的 ORM 几乎没有关系，而与您的数据库后端有很大关系。

也想想这个。与基于注释的查询相比，基于不同的查询在完成什么方面更清晰一个数量级。您是否有证据支持 DISTINCT 在您的情况下会变慢，或者更好的是它现在正在形成瓶颈？如果不是，您已经进入过早优化的范围，应该重新考虑您的路径。

过早的优化。

优化只有在重要时才重要。当它很重要时，它很重要，但在你知道它很重要之前，不要浪费很多时间去做。即使你知道它很重要，你也需要知道它在哪里重要。如果没有性能数据，您将不知道要优化什么，并且您可能会优化错误的东西。

The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.

In other words write your software clearly and then when you find a problem trace it to the source and fix it. Anything you do before that is counterproductive. Spend your time worrying about which indexes are going to matter on your db, and where to use select_related. Those are 10000% more effective than what you are worrying about here (unless you are counting zip codes all the time, in which case let me introduce you to caching)

mysql - Django 使用 Annotate 而不是 Distinct()

1 回答 1

Related

Reference