3

I have a list of about 18,000 unique words scraped from a database of government transcripts that I would like to make searchable in a web app. The catch: This web app must be client-side. (AJAX is permissible.)

All the original transcripts are in neat text files on my server, so the index file of words will list which files contain each word and how many times, like so:

ADMINSTRATION   {"16": 4, "11": 5, "29": 4, "14": 2}
ADMIRAL {"34": 12, "12": 2, "15": 9, "16": 71, "17": 104, "18": 37, "19": 23}
AMBASSADOR  {"2": 15, "3": 10, "5": 37, "8": 5, "41": 10, "10": 2, "16": 6, "17": 6, "50": 4, "20": 5, "22": 17, "40": 10, "25": 14}

I have this reduced to a trie-structure in its final form to save space and speed up retrieval, but even so, 18K words is about 5MB of data with the locations, even with stop words removed. But no one is reasonably going to search for out-of-context adjectives and subordinating conjunctions.

I realize this is something of a language question as much as a coding question, but I'm wondering if there is a common solution in NLP for reducing a text to words that are meaningful out of context.

I tried running each word through the Python NLTK POS tagger, but there's a high error rate when the words stand by themselves, as one would expect.

4

2 回答 2

1

我不会尝试减小字典的大小(您的 18K 单词),因为很难猜测哪些单词对您的应用程序/用户“有意义”。

相反,我会尝试减少每个文档放入索引的字数。例如,如果 50% 的文档有一个给定的单词 W,那么索引它可能是没有用的(当然,如果没有看到您的文档和您的域,我无法确定!)。

如果是这种情况,您可以计算文档中的TF-IDF,并选择一个阈值,低于该阈值您不会费心提供索引。您甚至可以选择索引的最大大小(例如 1MB)并找到适合此要求的阈值。

无论如何,我永远不会尝试使用 POS 标记。套用一句关于 Regex 的名言:

You have a simple indexing problem. You try to use POS-tagging to solve it. Now you have two problems.

于 2013-07-17T17:24:46.050 回答
0

NLP 是我的领域,恐怕只有一种方法可以可靠地做到这一点:首先对成绩单中的每个句子进行 POS 标记,然后提取 (word,pos-tag) 元组的统计信息。因此,您将能够将例如“returned”作为形容词的实例与该词用作动词的实例区分开来。最后,决定保留什么和丢弃什么(例如,只保留名词和动词,丢弃其他所有内容)。

于 2013-07-17T16:50:48.487 回答