12

我有一个带有 timescaledb 扩展的 postgres 数据库。

我的主索引是时间戳,我想选择最新的行。

如果我碰巧知道最近一行发生在某个时间之后,那么我可以使用如下查询:

query = 'select * from prices where time > %(dt)s'

这里我指定一个日期时间,并使用 psycopg2 执行查询:

# 2018-01-10 11:15:00
dt = datetime.datetime(2018,1,10,11,15,0)

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query, {'dt':dt})
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

定时输出:

took 2.296 ms

但是,如果我不知道输入上述查询的时间,我可以使用如下查询:

query = 'select * from prices order by time desc limit 1'

我以类似的方式执行查询

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query)
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

定时输出:

took 19.173 ms

所以这慢了 8 倍以上。

我不是 SQL 专家,但我原以为查询规划器会发现“限制 1”和“按主索引排序”等同于 O(1) 操作。

问题:

有没有更有效的方法来选择表中的最后一行?

如果它有用,这里是我的表的描述:

# \d+ prices

                                           Table "public.prices"
 Column |            Type             | Collation | Nullable | Default | Storage | Stats target | Description 
--------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
 time   | timestamp without time zone |           | not null |         | plain   |              | 
 AAPL   | double precision            |           |          |         | plain   |              | 
 GOOG   | double precision            |           |          |         | plain   |              | 
 MSFT   | double precision            |           |          |         | plain   |              | 
Indexes:
    "prices_time_idx" btree ("time" DESC)
Child tables: _timescaledb_internal._hyper_12_100_chunk,
              _timescaledb_internal._hyper_12_101_chunk,
              _timescaledb_internal._hyper_12_102_chunk,
              ...
4

4 回答 4

10

在 TimescaleDB 中获取最后/第一条记录的有效方法:

第一条记录:

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time ASC LIMIT 1 ;

最后记录:

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time DESC LIMIT 1 ;

这个问题已经回答了,但我相信如果人们来到这里可能会很有用。在 TimescaleDB 中使用 first() 和 last() 需要更长的时间。

于 2018-09-16T14:21:15.727 回答
4

您的第一个查询可以排除除最后一个块之外的所有块,而您的第二个查询必须查看每个块,因为没有信息可以帮助规划器排除块。所以它不是 O(1) 操作,而是 O(n) 操作,其中 n 是该超表的块数。

您可以通过以下形式编写查询来将该信息提供给规划者:

select * from prices WHERE time > now() - interval '1day' order by time desc limit 1

您可能必须根据您的块时间间隔选择不同的间隔。

从 TimescaleDB 1.2 开始,如果可以在最近的块中找到条目,并且如果您按时间排序并有 LIMIT,则不再需要 WHERE 子句中的显式时间约束,这是一个 O(1) 操作。

于 2018-07-30T05:20:22.867 回答
0

I tried to solve this problem in multiple ways: using last(), trying to create indexes to get the last items faster. In the end, I just ended up creating another table where I store the first and the last item inserted in the hypertable, keyed by WHERE condition that is a relationship in my case.

  • The database writer updates this table as well when it is inserting entries to the hypertable

  • I get first and last item with a simple BTree lookup - no need to go to hypertable at all

Here is my SQLAlchemy code:

class PairState(Base):
    """Cache the timespan endpoints for intervals we are generating with hypertable.

    Getting the first / last row (timestamp) from hypertable is very expensive:
    https://stackoverflow.com/questions/51575004/timescaledb-efficiently-select-last-row

    Here data is denormalised per trading pair, and being updated when data is written to the database.
    Save some resources by not using true NULL values.
    """

    __tablename__ = "pair_state"

    # This table has 1-to-1 relationship with Pair
    pair_id = sa.Column(sa.ForeignKey("pair.id"), nullable=False, primary_key=True, unique=True)
    pair = orm.relationship(Pair,
                        backref=orm.backref("pair_state",
                                        lazy="dynamic",
                                        cascade="all, delete-orphan",
                                        single_parent=True, ), )

    # First raw event in data stream
    first_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # Last raw event in data stream
    last_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # The last hypertable entry added
    last_interval_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    @staticmethod
    def create_first_event_if_not_exist(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Sets the first event value if not exist yet."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, first_event_at=ts).
            on_conflict_do_nothing()
        )

    @staticmethod
    def update_last_event(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_event_at for a named pair."""
        # Based on the original example of https://stackoverflow.com/a/49917004/315168
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_event_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_event_at": ts})
        )

    @staticmethod
    def update_last_interval(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_interval_at for a named pair."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_interval_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_interval_at": ts})
        )
于 2021-06-14T11:17:54.550 回答
0

创建表,您将在每次插入后存储最新的时间戳。并在查询中使用此时间戳。对我来说这是最有效的方法

SELECT <COLUMN> FROM <TABLE_NAME>, <TABLE_WITH_TIMESTAMPS> WHERE time = TABLE_WITH_TIMESTAMPS.time;
于 2021-11-15T12:37:11.890 回答