python - 最新值的 SQLAlchemy 高效子查询

Question

实体属性的当前值可以作为该实体的EntityHistorystatus表中的最新条目进行查询，即

Entities (id) <- EntityHistory (timestamp, entity_id, value)

如何编写一个高效的 SQLALchemy 表达式，该表达式急切地从历史表中为所有实体加载当前值，而不会导致 N+1 个查询？

我尝试为我的模型编写一个属性，但是当我迭代它时，这会为每个 (N+1) 生成一个查询。据我所知，没有子查询就无法解决这个问题，这对我来说在数据库上仍然效率低下。

示例`EntityHistory`数据：

timestamp |entity_id| value
==========|=========|======
     15:00|        1|     x
     15:01|        1|     y
     15:02|        2|     x
     15:03|        2|     y
     15:04|        1|     z

所以实体 1 的当前值为，实体 2 的当前z值为y。后备数据库是 Postgres。

score 5 · Accepted Answer

我认为您可以使用 acolumn_property将最新值作为Entities实例的属性与其他列映射属性一起加载：

from sqlalchemy import select
from sqlalchemy.orm import column_property

class Entities(Base):

    ...

    value = column_property(
        select([EntityHistory.value]).
        where(EntityHistory.entity_id == id).  # the id column from before
        order_by(EntityHistory.timestamp.desc()).
        limit(1).
        correlate_except(EntityHistory)
    )

子查询当然也可以在查询中使用，而不是column_property.

query = session.query(
    Entities,
    session.query(EntityHistory.value).
        filter(EntityHistory.entity_id == Entities.id).
        order_by(EntityHistory.timestamp.desc()).
        limit(1).
        label('value')
)

性能自然取决于适当的索引：

Index('entityhistory_entity_id_timestamp_idx',
      EntityHistory.entity_id,
      EntityHistory.timestamp.desc())

在某种程度上，这仍然是您可怕的 N+1，因为查询每行使用一个子查询，但它隐藏在到数据库的单次往返中。

另一方面，如果不需要将value作为属性，则在 Postgresql 中，您可以使用DISTINCT ON ... ORDER BY查询加入以获取最新值：Entities

values = session.query(EntityHistory.entity_id,
                       EntityHistory.value).\
    distinct(EntityHistory.entity_id).\
    # The same index from before speeds this up.
    # Remember nullslast(), if timestamp can be NULL.
    order_by(EntityHistory.entity_id, EntityHistory.timestamp.desc()).\
    subquery()

query = session.query(Entities, values.c.value).\
    join(values, values.c.entity_id == Entities.id)

尽管在使用虚拟数据的有限测试中，如果每个实体都有值，则子查询作为输出列总是以显着的优势击败连接。另一方面，如果有数百万个实体和大量缺失的历史值，那么 LEFT JOIN 会更快。我建议对您自己的数据进行测试，哪个查询更适合您的数据。对于单个实体的随机访问，假设索引已到位，相关子查询会更快。对于批量提取：测试。

python - 最新值的 SQLAlchemy 高效子查询

示例EntityHistory数据：

1 回答 1

Related

Reference

示例`EntityHistory`数据：