2

我有一个很常见的(至少我认为)数据库结构:有新闻(News(id, source_id)),每个新闻都有一个来源(Source(id, url))。源通过 聚合到主题 ( Topic(id, title)) TopicSource(source_id, topic_id)。此外,还有用户 ( User(id, name)) 可以将新闻标记为已读NewsRead(news_id, user_id)。这是一个清晰的图表: 数据库图

我想为特定用户计算主题中的未读新闻。问题是News表很大(10^6 - 10^7 行)。幸运的是,我不需要知道确切的计数,在一个阈值返回这个阈值作为计数值之后停止计数是可以的。

在针对一个主题的这个答案之后,我提出了以下查询:

SELECT t.topic_id, count(1) as unread_count
FROM (
 SELECT 1, topic_id
 FROM news n
   JOIN topic_source t ON n.source_id = t.source_id
   -- join news_read to filter already read news
   LEFT JOIN news_read r
     ON (n.id = r.news_id AND r.user_id = 1)
 WHERE t.topic_id = 3 AND r.user_id IS NULL
 LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;

查询计划 1)。此查询在测试数据库上大约需要 50 毫秒,这是可以接受的。

现在想为多个主题选择未读计数。我试图这样选择:

SELECT
  t.topic_id,
  (SELECT count(1)
   FROM (SELECT 1 FROM news n
          JOIN topic_source tt ON n.source_id = tt.source_id
          LEFT JOIN news_read r
            ON (n.id = r.news_id AND r.user_id = 1)
          WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
          LIMIT 10 -- Threshold
        ) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;

查询计划 2)。但是由于我不知道的原因,测试数据大约需要 1.5 秒,而单个查询的总和应该大约 0.2-0.3 秒。

我显然在这里遗漏了一些东西。第二个查询有错误吗?有没有更好(更快)的方法来选择未读新闻的计数?

附加信息:

桌子尺寸:

News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6

UPD:查询计划清楚地表明我搞砸了第二个查询。任何线索表示赞赏。

UPD2:我使用横向连接尝试了这个查询,它应该简单地为每个查询运行第一个(最快的)查询topic_id

SELECT
  id,
  count(*)
FROM topic t
  LEFT JOIN LATERAL (
     SELECT ts.topic_id
     FROM news n
       LEFT JOIN news_read r
         ON (n.id = r.news_id AND r.user_id = 1)
       JOIN topic_source ts ON n.source_id = ts.source_id
     WHERE ts.topic_id = t.id AND r.user_id IS NULL
     LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;

查询计划 3)。但似乎 Pg planner 对此有不同的看法——它运行非常慢的 seq 扫描和哈希连接,而不是索引扫描和合并连接。

4

1 回答 1

0

经过一些基准测试后,我终于停止了简单的UNION ALL查询,它比我的数据横向连接快十倍:

SELECT
  p.topic_id,
  count(*)
FROM (
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 4 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 10 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 12 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 16 AND r.user_id IS NULL
              LIMIT 100
            ) t1
     ) p
GROUP BY p.topic_id;

执行计划

这里的直觉是,通过显式指定 topic_id 可以为 Pg 计划者提供足够的信息来构建有效的计划。

SQLAlchemy角度来看,它非常简单:

# topic_ids, user_id are defined elsewhere, e.g.
# topic_ids = [4, 10, 12, 16]
# user_id = 1
for topic_id in topic_ids:
    topic_query = (
        db.session.query(News.id, TopicSource.topic_id)
        .join(TopicSource, TopicSource.source_id == News.source_id)
        # LEFT JOIN NewsRead table to filter only unreads
        # (where News.user_id IS NULL)
        .outerjoin(NewsRead,
                   and_(NewsRead.news_id == News.id,
                        NewsRead.user_id == user_id))
        .filter(TopicSource.topic_id == topic_id,
                NewsRead.user_id.is_(None))
        .limit(100))
    topic_queries.append(topic_query)
# Unite queries with UNION ALL
union_query = topic_queries[0].union_all(*topic_queries[1:])
# Groups query by `topic_id` and count unreads
counts = (union_query
          # Using `with_entities(func.count())` to avoid
          # a subquery.  See link below for info:
          # https://gist.github.com/hest/8798884
          .with_entities(TopicSource.topic_id.label('topic_id'),
                         func.count().label('unread_count'))
          .group_by(TopicSource.topic_id))
result = counts.all()
于 2017-12-19T16:59:29.860 回答