I have a list of about 18,000 unique words scraped from a database of government transcripts that I would like to make searchable in a web app. The catch: This web app must be client-side. (AJAX is permissible.)
All the original transcripts are in neat text files on my server, so the index file of words will list which files contain each word and how many times, like so:
ADMINSTRATION {"16": 4, "11": 5, "29": 4, "14": 2}
ADMIRAL {"34": 12, "12": 2, "15": 9, "16": 71, "17": 104, "18": 37, "19": 23}
AMBASSADOR {"2": 15, "3": 10, "5": 37, "8": 5, "41": 10, "10": 2, "16": 6, "17": 6, "50": 4, "20": 5, "22": 17, "40": 10, "25": 14}
I have this reduced to a trie-structure in its final form to save space and speed up retrieval, but even so, 18K words is about 5MB of data with the locations, even with stop words removed. But no one is reasonably going to search for out-of-context adjectives and subordinating conjunctions.
I realize this is something of a language question as much as a coding question, but I'm wondering if there is a common solution in NLP for reducing a text to words that are meaningful out of context.
I tried running each word through the Python NLTK POS tagger, but there's a high error rate when the words stand by themselves, as one would expect.