5

我有一些 OHLCV 数据存储在 TimescaleDB 中,其中包含某些时间范围内的缺失数据。该数据需要重新采样到不同的时间段(即 1 天)并包含连续的、有序的时间段。

TimescaleDB 提供了time_bucket_gapfill执行此操作的功能。我目前的查询是:

SELECT 
    time_bucket_gapfill(
        '1 day', 
        "timestamp",
        '2017-07-25 00:00', 
        '2018-01-01 00:00'
    ) as date,
    FIRST(open, "timestamp") as open,
    MAX(high) as high,
    MIN(low) as low,
    LAST(close, "timestamp") as close,
    SUM(volume) as volume
FROM ohlcv
WHERE "timestamp" > '2017-07-25'
GROUP BY date ORDER BY date ASC LIMIT 10

结果是

date                    open        high        low         close       volume
2017-07-25 00:00:00+00                  
2017-07-26 00:00:00+00                  
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

问题:看起来只有date列被填空了。通过修改 SQL 语句,是否也可以对列openhighlowclose等进行间隙填充,volume以便我们获得结果:

date                    open        high        low         close       volume
2017-07-25 00:00:00+00  0           0           0           0           0               
2017-07-26 00:00:00+00  0           0           0           0           0               
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  0.009248    0.009248    0.009248    0.009248    0   
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

或者是否建议在收到查询结果后执行此数据输入,例如在 Python/Nodejs 中?


如何使用 Python/pandas 完成的示例

更喜欢使用 TimescaleDB 而不是使用我的 Nodejs 应用程序来执行此间隙填充/输入,因为...使用 Nodejs 执行此操作会慢得多,我不想将 Python 引入应用程序只是为了执行此处理

import pandas as pd

# Building the test dataset simulating missing values after time_bucket
data = [
    (pd.Timestamp('2020-01-01'), None, None, None, None, None),
    (pd.Timestamp('2020-01-02'), 100, 110, 90, 95, 3),
    (pd.Timestamp('2020-01-03'), None, None, None, None, None),
    (pd.Timestamp('2020-01-04'), 98, 150, 100, 100, 4),
]
df = pd.DataFrame(data, columns=['date', 'open' , 'high', 'low', 'close', 'volume']).set_index('date')

#              open   high    low  close  volume
# date                                          
# 2020-01-01    NaN    NaN    NaN    NaN     NaN
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03    NaN    NaN    NaN    NaN     NaN
# 2020-01-04   98.0  150.0  100.0  100.0     4.0


# Perform gap filling
df.close = df.close.fillna(method='ffill')
df.volume = df.volume.fillna(0)                 # fill missing volume with 0
df['open'] = df['open'].fillna(df['close'])     # fill missing open by forward-filling close
df['high'] = df['high'].fillna(df['close'])     # fill missing high by forward-filling close
df['low'] = df['low'].fillna(df['close'])       # fill missing low by forward-filling close
df = df.fillna(0)                               # fill missing OHLC with 0 if no previous values available

#               open   high    low  close  volume
# date                                          
# 2020-01-01    0.0    0.0    0.0    0.0     0.0
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03   95.0   95.0   95.0   95.0     0.0
# 2020-01-04   98.0  150.0  100.0  100.0     4.0
4

2 回答 2

5
SELECT "tickerId",
       "ts",
       coalesce("open", "close")  "open",
       coalesce("high", "close")  "high",
       coalesce("low", "close")   "low",
       coalesce("close", "close") "close",
       coalesce("volume", 0)      "volume",
       coalesce("count", 0)       "count"

FROM (
     SELECT "tickerId",
            time_bucket_gapfill('1 hour', at)   "ts",
            first(price, "eId")                 "open",
            MAX(price)                          "high",
            MIN(price)                          "low",
            locf(last(price, "eId"))            "close",
            SUM(volume)                         "volume",
            COUNT(1)                            "count"
     FROM "PublicTrades"
     WHERE at >= date_trunc('day', now() - INTERVAL '1 year')
       AND at < NOW()
     GROUP BY "tickerId", "ts"
     ORDER BY "tickerId", "ts" DESC
     LIMIT 100
 ) AS P

注意:eId是交易所公共交易ID

于 2020-04-30T22:53:03.077 回答
2

您需要在每列中指定如何执行间隙填充。我的猜测是您可能想要使用locf. 看:

https://docs.timescale.com/latest/api#time_bucket_gapfill https://docs.timescale.com/latest/api#locf

于 2020-02-17T01:53:20.750 回答