I am building a SOLR cluster where each solr document corresponds to data about a company. For example, the following attributes are tracked:
1. name
2. size
3. location
4. awards
5. profit
My problem is that I also want to track historical data for the attributes that may change (such as size/awards). I know the easy way to do this is to have a document in SOLR for each time range. So If I wanted to get all companies that were under size 50 from 2012 - 2013 it's a simple SOLR query. However, I'm dealing with close to 20 million companies. And using the above strategy means that every time one attribute changes, we duplicate that document - dramatically increasing the number of documents in the solr cluster.
I am trying to think of a clever way to use fields in SOLR so I can track the deprecated attributes and their dates within the main companies document. But I can't seem to work out a good way to do it. I know that is partially because this problem isn't what SOLR was designed for and storing data this way means it's not properly normalized. However, I am just looking for a good way to avoid massively duplicating my data.
Key use case is to be able to execute queries like:
select all companies that were under size 50 from 2012 to 2013
So each attribute has to be linked to a value, a date valid, and a date deprecated field. Also the attribute value and dates must be searchable.
I want to do something like this:
{
"size":[
{
"date_deprecated": None,
"date_valid":"2015-01-01",
"value":"100"
},
{
"date_deprecated":"2014-12-31",
"date_valid":"2014-01-01",
"value":"50"
},
{
"date_deprecated":"2013-12-31",
"date_valid":"2013-01-01",
"value":"25"
}
]
}
But obviously that doesn't fly in SOLR. Also, the attributes (fields) are dynamic as I use a dynamic solr schema. So I don't necessary know what all the attributes are.
Any ideas?