I am looking to track topic popularity on a very large number of documents. Furthermore, I would like to give recommendations to users based on topics, rather than the usual bag of words model. To extract the topics I use natural language processing techniques that are beyond the point of this post.
My question is how should I persist this data so that: I) I can quickly fetch trending data for each topic (in principle, every time a user opens a document, the topics in that document should go up in popularity) II) I can quickly compare documents to provide recommendations (here I am thinking of using clustering techniques)
More specifically, my questions are: 1) Should I go with the usual way of storing text mining data? meaning storing a topic occurrence vector for each document, so that I can later measure the euclidean distance between different documents. 2) Some other way?
I am looking for specific python ways to do this. I have looked into SQL and NoSQL databases, and also into pytables and h5py, but I am not sure how I would go about implementing a system like this. One of my concerns is how can I deal with an ever growing vocabulary of topics?
Thank you very much