0

I am building a SOLR cluster where each solr document corresponds to data about a company. For example, the following attributes are tracked:

1. name
2. size
3. location
4. awards
5. profit

My problem is that I also want to track historical data for the attributes that may change (such as size/awards). I know the easy way to do this is to have a document in SOLR for each time range. So If I wanted to get all companies that were under size 50 from 2012 - 2013 it's a simple SOLR query. However, I'm dealing with close to 20 million companies. And using the above strategy means that every time one attribute changes, we duplicate that document - dramatically increasing the number of documents in the solr cluster.

I am trying to think of a clever way to use fields in SOLR so I can track the deprecated attributes and their dates within the main companies document. But I can't seem to work out a good way to do it. I know that is partially because this problem isn't what SOLR was designed for and storing data this way means it's not properly normalized. However, I am just looking for a good way to avoid massively duplicating my data.

Key use case is to be able to execute queries like:

select all companies that were under size 50 from 2012 to 2013

So each attribute has to be linked to a value, a date valid, and a date deprecated field. Also the attribute value and dates must be searchable.

I want to do something like this:

{  
   "size":[  
      {  
         "date_deprecated": None,
         "date_valid":"2015-01-01",
         "value":"100"
      },
      {  
         "date_deprecated":"2014-12-31",
         "date_valid":"2014-01-01",
         "value":"50"
      },
      {  
         "date_deprecated":"2013-12-31",
         "date_valid":"2013-01-01",
         "value":"25"
      }
   ]
}

But obviously that doesn't fly in SOLR. Also, the attributes (fields) are dynamic as I use a dynamic solr schema. So I don't necessary know what all the attributes are.

Any ideas?

4

1 回答 1

0

Duplication of the data may not be a big deal if you use Solr for search only and don't store the field content, but only index it. The indexed values even if it shows up in 20 documents are stored once and then only the the documents that contain them are listed.

So, you could have your primary data source with all the fields somewhere else and use Solr for search.

于 2015-02-11T05:04:58.607 回答