I've a standalone JAVA applications which operates on a large amount of elements read from an input file, each element being associated with an identifier. For each element, I do the following (among others of course):
- Check that the element has not already been processed using its identifier.
- Map the element to a grid using some statistical method, each cell of the grid being responsible for tracking the unique elements that were assigned to it, along with some properties calculated on each element.
The number of elements might be quite large (several millions), as well as the grid itself. Each cell is created on the fly as soon as an element has been assigned to it to avoid storing empty cells.
Question is: with large amount of data, memory issues naturally arise. What would be the best strategy to process large amount of data while avoiding memory issues ?
I've a couple of things in mind, but I'd like to know if anyone already has had this kind of problem, and if so, share its experience:
- Embedded lightweight SQL database
- Caching solutions such as Ehcache or apache jcs
- NoSQL Key-value stores such as cassandra
Thoughts ?