我们从交易所收到报价并将其存储在 KDB Ticker Plant。我们希望分析 RDB 和 HDB 中的卷,同时对这些数据库的性能影响最小,因为它们也被其他团队使用。
- 首先,我们如何创建一个函数,以 10 分钟为间隔分割一天,并为每个间隔创建一个包含数量的统计信息?我们需要使用哪些 KDB 函数?
- 其次,如何安全地做到这一点?我们应该在循环中逐部分提取记录还是一次查询一次?我们的数据库中每天有大约 1.5 亿条记录。
I'll make some assumptions about table and column names, which I'm sure you can extrapolate
We receive quotes from exchange and store them in KDB Ticker Plant
As a matter of definition, tickerplant
only stores data for a very small amount of time and then logs it to file and fires the data off to RDB (and other listeners).
with minimum impact on performance of these databases
It all depends on (a) your data volume (b) a most optimal where clause. It also depends on whether you have enough RAM on your machine to cope with the queries. The closer you get to the critical, the harder it is for the OS to allocate memory, and therefore the longer it takes to make the query (although memory allocation time pales in comparison to getting data off a disk - so disk speed is also a factor).
Firstly, how we may create a function which splits a day in 10 minutes intervals and for each interval create a stat with volume?
Your friend here is xbar: http://code.kx.com/q/ref/arith-integer/#xbar
getBy10MinsRDB:{[instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where sym=instrument, market=mkt
};
For an HDB the most optimal query (for a date-parted database) is date
then sym
then time
. In your case you haven't asked for time, so I omit.
getBy10MinsHDB:{[dt;instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where date=dt,sym=instrument, market=mkt
};
Should we extract records in a loop portion by portion or in one go with one query?
No, that's the absolute worst way of doing things in KDB :-) there's almost always a nice vector-ised solution.
We have around 150 million records for each day in our database.
Since KDB is a columnar database, the types of the columns you have are as important as the number of records; as that impacts memory.
because they are also used by other teams
If simple queries like above are causing issues, you need to consider splitting the table up by market, perhaps, to reduce query clash and load. If memory isn't an issue, consider -s
for HDB's for multithreaded queries (over multiple days). Consider negative port number on HDB for multithreaded input queue to minimise query clash (although it doesn't necessarily make things faster).