I have 3 milion abstracts and I would like to extract 4-grams from them. I want to build a language model so I need to find the frequencies of these 4-grams.
My problem is that I can't extract all these 4-grams in memory. How can I implement a system that it can estimate all frequencies for these 4-grams?