We're looking to migrate some data into InfluxDB. I'm working with InfluxDB 2.0 on a test server to determine the best way to stock our data.
As of today, I have about 2.7 billion series to migrate to InfluxDB but that number will only go up.
Here is the structure of the data I need to stock:
- ClientId (332 values as of today, string of 7 characters)
- Driver (int, 45k values as of today, will increase)
- Vehicle (int, 28k values as of today, will increase)
- Channel (100 values, should not increase, string of 40 characters)
- value of the channel (float, 1 value per channel/vehicle/driver/client at a given timestamp)
At first, I thought of stocking my data this way:
- One bucket (as all data have the same data retention)
- Measurements = channels (so 100 kind of measurements are stocked)
- Tag Keys = ClientId
- Fields = Driver, Vehicle, Value of channel
This gave me a cardinality of 1 * 100 * 332 * 3 = 99 600 according to this article
But then I realized that InfluxDB handle duplicate based on "measurement name, tag set, and timestamp".
So for my data, this will not work, as I need the duplicate to be based on ClientId, Channel, Vehicle at the minimum.
But if I change my data structure to be stored this way:
- One bucket (as all data have the same data retention)
- Measurements = channels (so 100 kind of measurements are stocked)
- Tag Keys = ClientId, Vehicle
- Fields = Driver, Value of channel
then I'll get a cardinality of 2 788 800 000.
I understand that I need to keep cardinality as low as possible. (And ideally I would even need to be able to search by driver as well as by vehicle.)
My questions are:
- If I split the data into different buckets (ex: 1 bucket per clientId), will it decrease my cardinality?
- What would be the best way to stock data for such a large amount of series?